Final Project Summary Paper

This summary paper continues investigating the same topic as our midterm project – the relationship between mental health and economic inequality. The results of our midterm project indicated that there was a relationship between income inequality and mental health, which varied geographically across regions of the U.S. In this paper, we used statistical models to better understand the strength of the relationship.

Specifically, we asked the following SMART questions:

  • Is the relationship between mental health and income inequality across U.S. counties robust when including other economic variables?
  • Does the relationship differ across regions of the U.S.?
  • Does the relationship depend on the measure used for mental health?

Our independent variable for measuring income inequality is the “ratio of household income at the 80th percentile to income at the 20th percentile.” We have two alternative measures of mental health that can be used as dependent variables: the rate of frequent mental distress and the average number of poor mental health days reported in the past 30 days. We also plan to consider additional variables that are expected to affect mental health, including median household income, the percentage of households with severe housing problems, the unemployment rate, the child poverty rate, and county-level demographic information.

Because the measures of mental health are continuous variables, we chose a linear regression model to assess the strength of the relationship. Additionally, we also used regression tree and random forest models to determine variable importance and better predict the prevalence of mental health issues.

Our paper is structured as follows. First, we open by describing our data set and summarizing the exploratory data analysis (EDA) that informed our statistical modeling. Next, we carried out feature selection for the linear regression models to help us ascertain which variables to use in our models. Then, we evaluated a series of linear regression models, using both dependent variables, and compared the results. Finally, we built regression tree models and random forests as a secondary method of supervised learning.

Our results suggest that mental health and economic inequality exhibit statistically significant correlations. Additionally, random forests were determined to be the best predicting models, and the variables ranked the consistently highest in feature importance include year, region, median_inc, and child_poverty. Finally, there are many limitations that negatively impact the reliability of this project’s analysis including the use of self-reported mental health data, the strong influence of the year variable on the models, our primary focus on the adjusted R-squared when comparing linear models, and tree models that could use additional tuning to produce accurate and precise predictions of the target variables.

Description of the Dataset

The primary dataset used for this project is a combination of annual datasets called the County Health Rankings and Roadmaps National Data and was obtained the Robert Wood Johnson Foundation (RWJF). The County Health Rankings and Roadmaps National Data consists of county-level socioeconomic and public health data. The reports published from 2016 to 2021 provided the most complete data for all the variables of interest. The annual datasets before 2016 were incomplete or did not have variables that exactly corresponded to variables in 2016 to 2021. In general, the years in the datasets reflect the year the report was released by RWJF, but the underlying source data are from a one or two calendar years earlier.

For the exploratory data analysis, the data was also separated into four regions defined by the U.S. Census Bureau.

The variables in the data set are:

  • region: name of the US Census Bureau region (name)
  • division: name of the US Census Bureau division (contained with a census region)
  • state: two letter state abbreviation
  • statecode: FIPS state code
  • countycode: FIPS county code
  • fipscode: 5-digit FIPS Code (county-level); combines statecode and countycode
  • county: county name
  • year: report release year from County Health Rankings; range of 2016-2021
  • county_ranked: Indicates whether or not the county was ranked; 0=unranked, 1=ranked, or NA for aggregated national or state-level data
  • mental_health_days: Average number of mentally unhealthy days reported in past 30 days (age-adjusted)
  • mental_distress_rate: Percentage of adults reporting 14 or more days of poor mental health per month
  • inequality: Ratio of household income at the 80th percentile to income at the 20th percentile (Income inequality)
  • median_inc: The income where half of households in a county earn more and half of households earn less
  • hs_grad: Percentage of adults ages 25 and over with a high school diploma or equivalent
  • college: Percentage of adults ages 25-44 with some post-secondary education
  • unempl: Percentage of population ages 16 and older unemployed but seeking work
  • child_poverty: Percentage of people under age 18 in poverty
  • single_parent: Percentage of children that live in a household headed by single parent
  • severe_housing: Percentage of households with severe housing problems
  • food_index: Index of factors that contribute to a healthy food environment, from 0 (worst) to 10 (best)
  • mh_providers: rate of providers to 100,000 population
  • pop_provider_ratio: ratio of population to mental health providers (i.e., population served per provider)
  • pop: census population estimate
  • pct_below18: percent of population younger than 18
  • pct_black: percent of population that are African-American or non-Hispanic Black
  • pct_native_am: percent of population that are Native American or Alaska Natives
  • pct_asian: percent of population that are Asian
  • pct_pacific: percent of population that are Native Hawaiian or Other Pacific Islander
  • pct_hispanic: percent of population that are Hispanic
  • pct_white: percent of population that are non-Hispanic white or Caucasian
  • pct_female: percent of population that are female
  • pct_rural: percent of population that live in rural areas
# look at county_ranked var; not all counties are ranked; also some aggregated data per state and country exist in the observations
# =1 means they are ranked, =0 means unranked, and =NA is for state/national data
#print(summary(dframe$county_ranked))

# subset of dataframe including only ranked counties
ranked <- dframe %>% subset(county_ranked==1)

# subset of dataframe including only ranked counties
unranked <- dframe %>% subset(county_ranked==0)

# subset of dataframe including only aggregated data
aggregated <- dframe %>% subset(is.na(county_ranked))

# duplicate column and rename level labels for easier reading
ranked$region_abb <- ranked$region
levels(ranked$region_abb) <- c("", 
                              "MW",  # re-level factor labels
                              "NE",
                              "S", 
                              "W")

# subset ranked data by region
ranked_MW <- ranked %>% subset(region=="Midwest")
ranked_NE <- ranked %>% subset(region=="Northeast")
ranked_SO <- ranked %>% subset(region=="South")
ranked_WE <- ranked %>% subset(region=="West")

# sort dataframe
ranked <- ranked[order(ranked$year, ranked$region, ranked$division, ranked$statecode, ranked$countycode), ]

Our data are identified at the county-year level. Our data set contains 19164 observations and 32 variables, although 319 of these observations are for aggregated data at the national or state level. In addition, 375 observations are for counties that are unranked by RWJF, suggesting that the data for these counties is less reliable. In total, we have 18470 observations in the ranked data, combined across 6 annual reports (2016–2021).

Exploratory Data Analysis

Descriptive Statistics

#Summary Statistics of all Variables in dataset
table1<-describeBy(ranked, type = 1) ## type of kurtosis and skewness to calculate
table1 %>%
  kbl(caption="Summary Statistics for Master Dataset",
       format= "html", col.names = c("Var Num.","Count","Mean","Std. Dev.","Median", "Trimmed Mean", "Mad", "Minimum", "Maximum","Range", "Skewness", "Kurtosis", "S.E."),
      align="r") %>%
   kable_classic_2(full_width = F, html_font = "helvetica")
Summary Statistics for Master Dataset
Var Num. Count Mean Std. Dev. Median Trimmed Mean Mad Minimum Maximum Range Skewness Kurtosis S.E.
region* 1 18470 3.40e+00 1.09e+00 4.00e+00 3.37e+00 1.48e+00 2.000 5.00e+00 3.00e+00 -0.198 -1.396 0.008
division* 2 18470 6.50e+00 2.88e+00 8.00e+00 6.63e+00 2.96e+00 2.000 1.00e+01 8.00e+00 -0.379 -1.413 0.021
state* 3 18470 2.70e+01 1.42e+01 2.60e+01 2.73e+01 1.78e+01 1.000 5.10e+01 5.00e+01 -0.028 -1.205 0.105
statecode* 4 18470 2.82e+01 1.43e+01 2.70e+01 2.85e+01 1.78e+01 2.000 5.20e+01 5.00e+01 -0.049 -1.214 0.105
countycode* 5 18470 6.55e+01 5.82e+01 5.30e+01 5.61e+01 4.45e+01 2.000 3.29e+02 3.27e+02 1.949 4.759 0.428
fipscode* 6 18469 1.60e+03 9.23e+02 1.59e+03 1.60e+03 1.18e+03 3.000 3.20e+03 3.20e+03 0.010 -1.204 6.793
county* 7 18470 9.43e+02 5.31e+02 9.45e+02 9.40e+02 6.61e+02 1.000 1.89e+03 1.89e+03 0.046 -1.123 3.908
year* 8 18470 3.50e+00 1.71e+00 4.00e+00 3.50e+00 1.48e+00 1.000 6.00e+00 5.00e+00 -0.002 -1.268 0.013
county_ranked* 9 18470 2.00e+00 0.00e+00 2.00e+00 2.00e+00 0.00e+00 2.000 2.00e+00 0.00e+00 NaN NaN 0.000
mental_health_days 10 18469 4.04e+00 6.92e-01 4.02e+00 4.03e+00 7.06e-01 2.100 7.29e+00 5.19e+00 0.223 -0.112 0.005
mental_distress_rate 11 18469 1.26e-01 2.40e-02 1.24e-01 1.25e-01 2.40e-02 0.066 2.47e-01 1.81e-01 0.532 0.302 0.000
inequality 12 18465 4.52e+00 7.33e-01 4.41e+00 4.45e+00 6.28e-01 2.543 1.20e+01 9.43e+00 1.254 3.433 0.005
median_inc 13 18469 5.08e+04 1.36e+04 4.87e+04 4.93e+04 1.10e+04 21658.000 1.52e+05 1.30e+05 1.424 3.681 99.952
hs_grad 14 16563 8.70e-01 7.90e-02 8.83e-01 8.78e-01 6.70e-02 0.025 1.00e+00 9.75e-01 -1.593 6.583 0.001
college 15 18469 5.71e-01 1.16e-01 5.72e-01 5.72e-01 1.22e-01 0.152 9.11e-01 7.59e-01 -0.088 -0.263 0.001
unempl 16 18469 5.00e-02 2.00e-02 4.60e-02 4.80e-02 1.70e-02 0.012 2.40e-01 2.28e-01 1.690 6.492 0.000
child_poverty 17 18469 2.21e-01 9.10e-02 2.09e-01 2.14e-01 9.00e-02 0.024 7.47e-01 7.23e-01 0.695 0.537 0.001
single_parent 18 18469 3.14e-01 1.06e-01 3.06e-01 3.08e-01 9.60e-02 0.000 8.72e-01 8.72e-01 0.696 1.251 0.001
severe_housing 19 18470 1.42e-01 4.70e-02 1.37e-01 1.39e-01 3.70e-02 0.022 7.13e-01 6.91e-01 2.086 14.166 0.000
food_index 20 18394 7.32e+00 1.18e+00 7.50e+00 7.44e+00 1.04e+00 0.000 1.00e+01 1.00e+01 -1.386 3.743 0.009
mh_providers 21 17028 1.00e-03 2.00e-03 1.00e-03 1.00e-03 1.00e-03 0.000 2.40e-02 2.40e-02 3.457 24.730 0.000
pop_provider_ratio 22 17028 2.00e+03 2.84e+03 9.90e+02 1.38e+03 8.97e+02 -957.000 5.49e+04 5.58e+04 4.291 37.897 21.797
pop 23 18469 1.05e+05 3.34e+05 2.67e+04 4.40e+04 2.79e+04 810.000 1.02e+07 1.02e+07 13.652 310.896 2457.693
pct_below18 24 18469 2.23e-01 3.40e-02 2.22e-01 2.22e-01 2.80e-02 0.051 4.20e-01 3.69e-01 0.500 2.410 0.000
pct_black 25 18469 9.10e-02 1.44e-01 2.30e-02 5.70e-02 2.80e-02 0.000 8.59e-01 8.59e-01 2.265 5.074 0.001
pct_native_am 26 18469 2.30e-02 7.70e-02 6.00e-03 8.00e-03 5.00e-03 0.000 9.31e-01 9.31e-01 7.643 66.364 0.001
pct_asian 27 18469 1.50e-02 2.80e-02 7.00e-03 9.00e-03 5.00e-03 0.000 4.30e-01 4.30e-01 6.880 66.735 0.000
pct_pacific 28 18469 1.00e-03 4.00e-03 1.00e-03 1.00e-03 1.00e-03 0.000 1.31e-01 1.31e-01 21.327 545.017 0.000
pct_hispanic 29 18469 9.40e-02 1.37e-01 4.20e-02 6.10e-02 3.60e-02 0.004 9.64e-01 9.59e-01 3.100 11.051 0.001
pct_white 30 18469 7.63e-01 2.01e-01 8.36e-01 7.94e-01 1.60e-01 0.027 9.81e-01 9.55e-01 -1.185 0.796 0.001
pct_female 31 18469 4.99e-01 2.20e-02 5.03e-01 5.02e-01 1.10e-02 0.265 5.70e-01 3.05e-01 -3.180 17.262 0.000
pct_rural 32 18447 5.78e-01 3.12e-01 5.87e-01 5.90e-01 3.82e-01 0.000 1.00e+00 1.00e+00 -0.131 -1.121 0.002
region_abb* 33 18470 3.40e+00 1.09e+00 4.00e+00 3.37e+00 1.48e+00 2.000 5.00e+00 3.00e+00 -0.198 -1.396 0.008

The summary statistics show that we have nearly complete data for our most relevant quantitative variables – namely, the measures of mental health, income inequality, and median household income. Several other important economic variables, such as child_poverty, single_parent, and severe_housing, have nearly complete data too.

Our intended dependent variables, mental_health_days and mental_distress_rate, each have skewness and kurtosis with an absolute value below 1, which indicates those variables have relatively normal distributions. The mean and median for mentally unhealthy days in the past 30 days are similar at 4.04 and 4.017, with a standard deviation of 0.692. The mean mental_distress_rate is 0.126 with a standard deviation of 0.024. The median is slightly below the mean.

The economic variables, particularly inequality, have greater skewness and kurtosis, suggesting their distributions are less symmetrical and have larger tails. inequality has a mean of 4.524, a median of 4.413, and a standard deviation of 0.733.

Scatterplots

We also plotted the mental health measures vs. inequality by region. The results are striking. The Midwest and the South both exhibit a moderate correlation between the variables. The West has a weaker, but still positive, correlation. And in the Northeast, mental health and inequality do not seem to be correlated much at all.

# data: use ranked_MW, ranked_NE, ranked_SO, ranked_WE

rgb_colors <- c("#A27BB8", "#006994", "#B52E1F", "#00873E")

#-- mental_health_days --#

p1 <- ggplot(ranked_MW, aes(y=mental_health_days, x=inequality)) + 
  geom_point(show.legend = F, alpha = .7, color = rgb_colors[1]) + 
  geom_smooth(formula = y ~ x, method=lm, se=FALSE, color="#a89968") + 
  labs(title = paste("(a) Midwest"),
       y = 'Days per 30', x = 'Income Inequality Rate')

p2 <- ggplot(ranked_NE, aes(y=mental_health_days, x=inequality)) + 
  geom_point(show.legend = F, alpha = .7, color = rgb_colors[2]) +
  geom_smooth(formula = y ~ x, method=lm, se=FALSE, color="#a89968") +
  labs(title = paste("(b) Northeast"),
       y = 'Days per 30', x = 'Income Inequality Rate')

p3 <- ggplot(ranked_SO, aes(y=mental_health_days, x=inequality)) + 
  geom_point(show.legend = F, alpha = .7, color = rgb_colors[3]) +
  geom_smooth(formula = y ~ x, method=lm, se=FALSE, color="#a89968") +
  labs(title = paste("(c) South"),
       y = 'Days per 30', x = 'Income Inequality Rate')

p4 <- ggplot(ranked_WE, aes(y=mental_health_days, x=inequality)) + 
  geom_point(show.legend = F, alpha = .7, color = rgb_colors[4]) +
  geom_smooth(formula = y ~ x, method=lm, se=FALSE, color="#a89968") +
  labs(title = paste("(d) West"),
       y = 'Days per 30', x = 'Income Inequality Rate')

p_regions_1 <- (p1 + p2)/(p3 + p4) + plot_annotation(title = "Scatterplot of Poor Mental Health Days vs. Income Inequality by Region")
p_regions_1

#-- mental_distress_rate --#

p1 <- ggplot(ranked_MW, aes(y=mental_distress_rate, x=inequality)) + 
  geom_point(show.legend = F, alpha = .7, color = rgb_colors[1]) + 
  geom_smooth(formula = y ~ x, method=lm, se=FALSE, color="#a89968") + 
  labs(title = paste("(a) Midwest"),
       y = '% Frequent Distress', x = 'Income Inequality Rate')

p2 <- ggplot(ranked_NE, aes(y=mental_distress_rate, x=inequality)) + 
  geom_point(show.legend = F, alpha = .7, color = rgb_colors[2]) +
  geom_smooth(formula = y ~ x, method=lm, se=FALSE, color="#a89968") +
  labs(title = paste("(b) Northeast"),
       y = '% Frequent Distress', x = 'Income Inequality Rate')

p3 <- ggplot(ranked_SO, aes(y=mental_distress_rate, x=inequality)) + 
  geom_point(show.legend = F, alpha = .7, color = rgb_colors[3]) +
  geom_smooth(formula = y ~ x, method=lm, se=FALSE, color="#a89968") +
  labs(title = paste("(c) South"),
       y = '% Frequent Distress', x = 'Income Inequality Rate')

p4 <- ggplot(ranked_WE, aes(y=mental_distress_rate, x=inequality)) + 
  geom_point(show.legend = F, alpha = .7, color = rgb_colors[4]) +
  geom_smooth(formula = y ~ x, method=lm, se=FALSE, color="#a89968") +
  labs(title = paste("(d) West"),
       y = '% Frequent Distress', x = 'Income Inequality Rate')

p_regions_2 <- (p1 + p2)/(p3 + p4) + plot_annotation(title = "Scatterplot of Frequent Mental Distress Rate vs. Income Inequality by Region")
p_regions_2

At this point in the EDA, it became clear that geographic differences in the relationship between mental health and inequality were important. These results justified our decision to focus on the differences by region.

Boxplots

To better show the relationship between the economic variables and mental health in different regions, we generated boxplots of the data by region.

# Characterization of median household income of four regions
b1 <- ggplot(ranked, aes(x=region_abb, y=median_inc)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="Median Income", x="region")

# Characterization of Income inequality of four regions
b2 <- ggplot(ranked, aes(x=region_abb, y=inequality)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="Inequality", x="region")

# Characterization of poor mental health days of four regions
b3 <- ggplot(ranked, aes(x=region_abb, y=mental_health_days)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="Poor Mental Health Days", x="region")

# Characterization of the Frequent mental distress rate of four regions
b4 <- ggplot(ranked, aes(x=region_abb, y=mental_distress_rate)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="Frequent Mental Distress", x="region")

boxplot_a <- grid.arrange(b1,b2,b3,b4, nrow=2, ncol=2) #, top = text_grob("Boxplots of Key Variables by Region", color = "black", face = "bold", size = 14))

Overall, what information can we conclude from these boxplots?

  1. The Northeast has the highest median household income and the South has the lowest.
  2. Income inequality is highest in the South.
  3. Residents in the South also have the worst mental health, which correlates with their relatively low household income and higher levels of inequality.

As a whole, there seems to be real differences across regions in these key variables. The boxplots also indicate that these variables have outliers. As a result, we should look into the normality of the data before modeling.

Normality

We also graphed histograms of many of the numeric variables to consider their distributions. Although none are perfectly “normal” distributions, many do seem to approximate normality. Most importantly, mental_distress_rate and mental_health_days have distributions that somewhat resemble normality. Generally, the economic variables – such as inequality, median_inc, and unempl – are right-skewed.

# variables to plot
keep_var <- c('inequality', 'unempl', 'child_poverty', 'hs_grad', 'college', 'single_parent', 'severe_housing', 'food_index', 'mental_health_days', 'mental_distress_rate', 'median_inc', 'pop_provider_ratio')

# plot histograms
ranked[keep_var] %>% 
  gather() %>% 
  ggplot(aes(x=value)) + 
    facet_wrap(~ key, scales = "free") + 
    geom_histogram(bins=50, color = "#033C5A") +
  labs(title = "Histograms for Select Numeric Variables")

To complement the histograms, we also graphed Q-Q plots to assess the normality of our data. These plots compare the theoretical and sample distributions for a variable at specific quantiles. The line on each graph estimates what the variable would look like with a normal distribution.

In general, the mental health variables do not diverge too far from normality. Also, inequality and median_inc tightly follow a normal distribution until the upper range of their values, reflecting the greater spread that exists at the high end of the distribution.

# plot qq-norm plots
ranked[keep_var] %>% 
  gather() %>% 
  ggplot(aes(sample=value), na.rm=T) + 
    facet_wrap(~ key, scales = "free") + 
    stat_qq(color = "#033C5A") + stat_qq_line() + labs(title = "Q-Q Plots for Select Numeric Variables", y = "Sample Quantiles", x = "Theoretical Quantiles")

Despite not having a perfectly normal distribution, the sample quantiles remain relatively close to the theoretical quantiles for many of our variables. inequality and median_inc have right-tailed distributions, but without very thick tails. Ultimately, even if none of our variables are perfectly “normal,” few would seem to pose substantial issues during modeling.

Correlation Matrix

We also assessed the correlation among the numeric variables. This helps us establish whether variables are positively or negatively associated with each other, as well as the strength of that relationship.

# Correlation Matrices for numeric data

ranked_numeric <- subset(ranked, select = c("mental_health_days", "mental_distress_rate", "inequality", "median_inc", "hs_grad", "college", "unempl", "child_poverty","single_parent", "severe_housing", "food_index","mh_providers","pop_provider_ratio"))

a <- as.matrix(ranked_numeric)

b <- cor(a, use = "na.or.complete")

#corr_numbers <- corrplot(b, is.corr=TRUE, method="number", title="Correlation Matrix for Numeric Vars.",mar=c(0,0,1,0))

corr_numbers <- corrplot(b, is.corr=TRUE, title="Correlation Matrix for Numeric Vars.",mar=c(0,0,1,0))

In general, the results were not surprising. The two mental health variables are very positively correlated, and both are also positively correlated with inequality. In addition, inequality exhibited the strongest positive association with child_poverty and single_parent, and it had the strongest negative correlation with food_index and median_inc. These results make sense (remember that a higher score on the food index indicates a better food environment).

Feature Selection for Linear Models

Given the large set of variables that we have at our disposal. We used feature selection to help us identify key elements of the data to use in our models. We engaged in this exercise for both dependent variables: mental_health_days and mental_distress_rate.

#rankedfsf = ranked[ -c(1:9) ]
rankedfs = ranked [c(10:33)] 
head(rankedfs)

Target Variable: mental_health_days

1. Perform Linear Regression with All Predictors

Before selecting the best subset of predictors for our regression, we ran a simple linear regression on our dataset with all predictors to set the base adjusted R2 for comparison.

We performed linear regression with all predictors having mental health days as a target variable and excluding mental distress rate. And we are doing this to know whether these two mental health variables have similar relationships with the other predictors or not. That is why we are developing models separately.

#Before selecting the best subset of predictors for our regression, let’s run a simple linear regression on our dataset with all predictors to set the base adjusted r² for comparison.

lm1 <- lm(rankedfs,formula=mental_health_days ~. -mental_distress_rate)
summ(lm1)
Observations 15728 (2742 missing obs. deleted)
Dependent variable mental_health_days
Type OLS linear regression
F(24,15703) 617.01
0.49
Adj. R² 0.48
Est. S.E. t val. p
(Intercept) 7.14 0.45 15.84 0.00
inequality 0.15 0.01 19.87 0.00
median_inc 0.00 0.00 3.54 0.00
hs_grad 0.61 0.06 10.26 0.00
college -2.68 0.06 -45.71 0.00
unempl -4.32 0.27 -16.16 0.00
child_poverty 2.88 0.12 24.73 0.00
single_parent -1.71 0.07 -25.28 0.00
severe_housing -0.42 0.13 -3.35 0.00
food_index -0.04 0.01 -6.08 0.00
mh_providers 36.22 3.09 11.71 0.00
pop_provider_ratio -0.00 0.00 -10.10 0.00
pop 0.00 0.00 5.36 0.00
pct_below18 -1.99 0.15 -13.13 0.00
pct_black -5.04 0.45 -11.30 0.00
pct_native_am -4.20 0.47 -8.98 0.00
pct_asian -7.42 0.51 -14.50 0.00
pct_pacific -9.93 1.49 -6.67 0.00
pct_hispanic -5.86 0.43 -13.59 0.00
pct_white -5.08 0.45 -11.32 0.00
pct_female 6.28 0.23 27.83 0.00
pct_rural -0.22 0.02 -10.86 0.00
region_abbNE 0.16 0.02 9.44 0.00
region_abbS 0.18 0.01 15.30 0.00
region_abbW 0.16 0.02 10.16 0.00
Standard errors: OLS

With all of our variables included in the model, we can see that the base adjusted R2 is 0.485 and the residual standard error is 0.479.

# which all feature selection methods we can perform:

rankedfs <- na.omit(rankedfs)
regfit.full = regsubsets(mental_health_days ~.-mental_distress_rate,rankedfs, nvmax=23)
reg.summary <- summary(regfit.full)
names(reg.summary)
# Feature Selection: mental_health_days
reg.best10 <- regsubsets(mental_health_days~. - mental_distress_rate , data = rankedfs, nvmax = NULL, nbest = 1, method = "exhaustive")
                             
reg.best10                   
## Subset selection object
## Call: regsubsets.formula(mental_health_days ~ . - mental_distress_rate, 
##     data = rankedfs, nvmax = NULL, nbest = 1, method = "exhaustive")
## 25 Variables  (and intercept)
##                    Forced in Forced out
## inequality             FALSE      FALSE
## median_inc             FALSE      FALSE
## hs_grad                FALSE      FALSE
## college                FALSE      FALSE
## unempl                 FALSE      FALSE
## child_poverty          FALSE      FALSE
## single_parent          FALSE      FALSE
## severe_housing         FALSE      FALSE
## food_index             FALSE      FALSE
## mh_providers           FALSE      FALSE
## pop_provider_ratio     FALSE      FALSE
## pop                    FALSE      FALSE
## pct_below18            FALSE      FALSE
## pct_black              FALSE      FALSE
## pct_native_am          FALSE      FALSE
## pct_asian              FALSE      FALSE
## pct_pacific            FALSE      FALSE
## pct_hispanic           FALSE      FALSE
## pct_white              FALSE      FALSE
## pct_female             FALSE      FALSE
## pct_rural              FALSE      FALSE
## region_abbMW           FALSE      FALSE
## region_abbNE           FALSE      FALSE
## region_abbS            FALSE      FALSE
## region_abbW            FALSE      FALSE
## 1 subsets of each size up to 24
## Selection Algorithm: exhaustive
# Best model at each variable number
summary.out <- summary(reg.best10)
as.data.frame(summary.out$outmat)
##           inequality median_inc hs_grad college unempl child_poverty
## 1  ( 1 )                                                           *
## 2  ( 1 )           *                          *                     
## 3  ( 1 )           *                          *                     
## 4  ( 1 )           *                          *                     
## 5  ( 1 )           *                          *                     
## 6  ( 1 )           *                          *                    *
## 7  ( 1 )           *                          *                    *
## 8  ( 1 )           *                          *                    *
## 9  ( 1 )           *                          *      *             *
## 10  ( 1 )          *                          *      *             *
## 11  ( 1 )          *                  *       *      *             *
## 12  ( 1 )          *                          *      *             *
## 13  ( 1 )          *                  *       *      *             *
## 14  ( 1 )          *                  *       *      *             *
## 15  ( 1 )          *                  *       *      *             *
## 16  ( 1 )          *                  *       *      *             *
## 17  ( 1 )          *                  *       *      *             *
## 18  ( 1 )          *                  *       *      *             *
## 19  ( 1 )          *                  *       *      *             *
## 20  ( 1 )          *                  *       *      *             *
## 21  ( 1 )          *                  *       *      *             *
## 22  ( 1 )          *          *       *       *      *             *
## 23  ( 1 )          *          *       *       *      *             *
## 24  ( 1 )          *          *       *       *      *             *
##           single_parent severe_housing food_index mh_providers
## 1  ( 1 )                                                      
## 2  ( 1 )                                                      
## 3  ( 1 )                                                      
## 4  ( 1 )                                                      
## 5  ( 1 )                                                     *
## 6  ( 1 )              *                                       
## 7  ( 1 )              *                                      *
## 8  ( 1 )              *                                      *
## 9  ( 1 )              *                                      *
## 10  ( 1 )             *                                      *
## 11  ( 1 )             *                                      *
## 12  ( 1 )             *                                      *
## 13  ( 1 )             *                                      *
## 14  ( 1 )             *                                      *
## 15  ( 1 )             *                                      *
## 16  ( 1 )             *                                      *
## 17  ( 1 )             *                                      *
## 18  ( 1 )             *                                      *
## 19  ( 1 )             *                         *            *
## 20  ( 1 )             *                         *            *
## 21  ( 1 )             *              *          *            *
## 22  ( 1 )             *              *          *            *
## 23  ( 1 )             *              *          *            *
## 24  ( 1 )             *              *          *            *
##           pop_provider_ratio pop pct_below18 pct_black pct_native_am pct_asian
## 1  ( 1 )                                                                      
## 2  ( 1 )                                                                      
## 3  ( 1 )                                                                      
## 4  ( 1 )                                                                      
## 5  ( 1 )                                                                      
## 6  ( 1 )                                                                      
## 7  ( 1 )                                                                      
## 8  ( 1 )                                                                      
## 9  ( 1 )                                                                      
## 10  ( 1 )                  *                                                  
## 11  ( 1 )                  *                                                  
## 12  ( 1 )                                  *                       *          
## 13  ( 1 )                                  *                       *          
## 14  ( 1 )                                  *                       *         *
## 15  ( 1 )                  *               *                       *         *
## 16  ( 1 )                  *               *         *                       *
## 17  ( 1 )                  *               *         *             *         *
## 18  ( 1 )                  *               *         *             *         *
## 19  ( 1 )                  *               *         *             *         *
## 20  ( 1 )                  *   *           *         *             *         *
## 21  ( 1 )                  *   *           *         *             *         *
## 22  ( 1 )                  *   *           *         *             *         *
## 23  ( 1 )                  *   *           *         *             *         *
## 24  ( 1 )                  *   *           *         *             *         *
##           pct_pacific pct_hispanic pct_white pct_female pct_rural region_abbMW
## 1  ( 1 )                                                                      
## 2  ( 1 )                                                                      
## 3  ( 1 )                         *                                            
## 4  ( 1 )                         *                    *                       
## 5  ( 1 )                         *                    *                       
## 6  ( 1 )                         *                    *                       
## 7  ( 1 )                         *                    *                       
## 8  ( 1 )                         *                    *                      *
## 9  ( 1 )                         *                    *                      *
## 10  ( 1 )                        *                    *                      *
## 11  ( 1 )                        *                    *                      *
## 12  ( 1 )                        *                    *         *            *
## 13  ( 1 )                        *                    *         *            *
## 14  ( 1 )                        *                    *         *            *
## 15  ( 1 )                        *                    *         *            *
## 16  ( 1 )                        *         *          *         *            *
## 17  ( 1 )                        *         *          *         *            *
## 18  ( 1 )           *            *         *          *         *            *
## 19  ( 1 )           *            *         *          *         *            *
## 20  ( 1 )           *            *         *          *         *            *
## 21  ( 1 )           *            *         *          *         *            *
## 22  ( 1 )           *            *         *          *         *            *
## 23  ( 1 )           *            *         *          *         *            *
## 24  ( 1 )           *            *         *          *         *            *
##           region_abbNE region_abbS region_abbW
## 1  ( 1 )                                      
## 2  ( 1 )                                      
## 3  ( 1 )                                      
## 4  ( 1 )                                      
## 5  ( 1 )                                      
## 6  ( 1 )                                      
## 7  ( 1 )                                      
## 8  ( 1 )                                      
## 9  ( 1 )                                      
## 10  ( 1 )                                     
## 11  ( 1 )                                     
## 12  ( 1 )                                     
## 13  ( 1 )                                     
## 14  ( 1 )                                     
## 15  ( 1 )                                     
## 16  ( 1 )                                     
## 17  ( 1 )                                     
## 18  ( 1 )                                     
## 19  ( 1 )                                     
## 20  ( 1 )                                     
## 21  ( 1 )                                     
## 22  ( 1 )                                     
## 23  ( 1 )                        *            
## 24  ( 1 )            *           *

The best model in the 10-variable case includes all variables, as that is the only way to have 10 variables.

plot(reg.best10, scale = "adjr2", main = "Adjusted R^2")

plot(reg.best10, scale = "r2", main = "R^2")

plot(reg.best10, scale = "bic", main = "BIC")

plot(reg.best10, scale = "Cp", main = "Cp")

coef(reg.best10, 10, scale = "adjr2") # default BIC
##        (Intercept)         inequality            college             unempl 
##           2.20e+00           1.57e-01          -2.51e+00          -4.49e+00 
##      child_poverty      single_parent       mh_providers pop_provider_ratio 
##           2.80e+00          -1.55e+00           4.59e+01          -1.95e-05 
##       pct_hispanic         pct_female       region_abbMW 
##          -1.12e+00           5.68e+00          -1.91e-01

The summary table below provides details on which predictors to use for the model. The best predictors are indicated by the word “true.” And then we selected the best 10 features with the method adjusted R2 to build the model with the selected features.

summary.out$which[10,]
   (Intercept)         inequality         median_inc            hs_grad 
          TRUE               TRUE              FALSE              FALSE 
       college             unempl      child_poverty      single_parent 
          TRUE               TRUE               TRUE               TRUE 
severe_housing         food_index       mh_providers pop_provider_ratio 
         FALSE              FALSE               TRUE               TRUE 
           pop        pct_below18          pct_black      pct_native_am 
         FALSE              FALSE              FALSE              FALSE 
     pct_asian        pct_pacific       pct_hispanic          pct_white 
         FALSE              FALSE               TRUE              FALSE 
    pct_female          pct_rural       region_abbMW       region_abbNE 
          TRUE              FALSE               TRUE              FALSE 
   region_abbS        region_abbW 
         FALSE              FALSE 

2. Perform Linear Regression with selected best 10 variables after feature selection

lm2 <- lm(rankedfs,formula=mental_health_days ~ child_poverty + inequality + unempl + college + mh_providers + pop_provider_ratio + region_abb + single_parent + pct_hispanic + pct_female)
summ(lm2, vifs=T)
Observations 15728
Dependent variable mental_health_days
Type OLS linear regression
F(12,15715) 1120.94
0.46
Adj. R² 0.46
Est. S.E. t val. p VIF
(Intercept) 2.12 0.10 20.72 0.00 NA
child_poverty 2.73 0.09 29.40 0.00 4.47
inequality 0.15 0.01 20.79 0.00 1.80
unempl -4.09 0.26 -15.50 0.00 1.68
college -2.47 0.05 -45.05 0.00 2.52
mh_providers 50.86 3.00 16.98 0.00 1.47
pop_provider_ratio -0.00 0.00 -13.25 0.00 1.29
region_abbNE 0.16 0.02 10.02 0.00 1.99
region_abbS 0.22 0.01 20.70 0.00 1.99
region_abbW 0.13 0.01 9.17 0.00 1.99
single_parent -1.57 0.06 -27.81 0.00 2.11
pct_hispanic -1.08 0.03 -33.96 0.00 1.16
pct_female 5.45 0.22 25.23 0.00 1.25
Standard errors: OLS

In the model with best 10 selected features, the adjusted R2 has declined significantly to 0.461.

3. Plot Output from regsubsets Function in leaps package

res.legend <-
    subsets(regsubsets(mental_health_days~. -mental_distress_rate , data = rankedfs, nvmax = 10, nbest = 1, method = "exhaustive"), statistic="adjr2", legend = FALSE, min.size = 5, main = "Adjusted R^2")

## Mallow Cp
res.legend <-
    subsets(regsubsets(mental_health_days~. -mental_distress_rate , data = rankedfs, nvmax = 10, nbest = 1, method = "exhaustive"), statistic="cp", legend = FALSE, min.size = 5, main = "Mallow Cp")
abline(a = 1, b = 1, lty = 2)

#BIC
res.legend <-
    subsets(regsubsets(mental_health_days~. -mental_distress_rate , data = rankedfs, nvmax = 10, nbest = 1, method = "exhaustive"), statistic="bic", legend = FALSE, min.size = 5, main = "BIC")
abline(a = 1, b = 1, lty = 2)

#r2
res.legend <-
    subsets(regsubsets(mental_health_days~. -mental_distress_rate , data = rankedfs, nvmax = 10, nbest = 1, method = "exhaustive"), statistic="rsq", legend = FALSE, min.size = 5, main = "R squared")
abline(a = 1, b = 1, lty = 2)

res.legend
##                    Abbreviation
## inequality                    i
## median_inc                  md_
## hs_grad                       h
## college                      cl
## unempl                        u
## child_poverty                c_
## single_parent               sn_
## severe_housing              sv_
## food_index                    f
## mh_providers                mh_
## pop_provider_ratio         pp__
## pop                          pp
## pct_below18                 p_1
## pct_black                 pct_b
## pct_native_am              pc__
## pct_asian                 pct_s
## pct_pacific               pct_p
## pct_hispanic              pct_h
## pct_white                 pct_w
## pct_female                pct_f
## pct_rural                 pct_r
## region_abbMW                r_M
## region_abbNE                r_N
## region_abbS                 r_S
## region_abbW                 r_W
# See which model has the highest R Square :
days_maxR2 <- which.max(summary.out$rsq)
days_maxR2

As we can see, at 24 features, we are getting the highest R2 and we have plotted the graph below.

plot(summary.out$adjr2,xlab="number of variables", ylab="R Square", type="l")
points(days_maxR2, summary.out$rsq[days_maxR2], col='red', cex=2, pch=20)

# See which model has the highest adjusted R2 :
days_maxAdjR2 <- which.max(summary.out$adjr2)
days_maxAdjR2

As we can see, at 23 features, we are getting the maximum adjusted R2 and we have plotted the graph below.

plot(summary.out$adjr2,xlab="number of variables", ylab="adjr2", type="l")
points(days_maxAdjR2, summary.out$adjr2[days_maxAdjR2], col='red', cex=2, pch=20)

#See which model has the lowest CP :
days_minCP <- which.min(summary.out$cp)
days_minCP

As we can see, at 22 features, we are getting the lowest CP and we have plotted the graph below.

plot(summary.out$cp,xlab="number of variables", ylab="cp", type="l")
points(days_minCP, summary.out$cp[days_minCP], col='red', cex=2, pch=20)

#See which model has the lowest BIC :
days_minBIC <- which.min(summary.out$bic)
days_minBIC

As we can see, at 22 features, we are getting the minimum BIC.

plot(summary.out$bic,xlab="number of variables", ylab="BIC", type="l")
points(days_minBIC, summary.out$bic[days_minBIC], col='red', cex=2, pch=20)

#forward and backward selection is same as exhaustive just slight different in forward it start will no variable and iterate till the max variable and backward selection is otherway around.
#validation
set.seed(1)
train=sample(c(TRUE, FALSE), nrow(rankedfs), rep=T)
test = (!train)
regfit.best=regsubsets(mental_health_days~.-mental_distress_rate ,data=rankedfs[train,],nvmax = 23)
test.mat = model.matrix(mental_health_days~.-mental_distress_rate , data = rankedfs[test,])
val.errors = rep(NA, 23)
for(i in 1:23){
  coefi = coef(regfit.best, id=i)
   pred = test.mat[,names(coefi)]%*%coefi
   val.errors[i]= mean((rankedfs$mental_health_days[test]-pred)^2)
   }
val.errors
##  [1] 0.327 0.311 0.292 0.279 0.271 0.265 0.256 0.251 0.246 0.243 0.242 0.241
## [13] 0.238 0.237 0.235 0.234 0.234 0.232 0.232 0.232 0.232 0.231 0.231
which.min(val.errors)
## [1] 22
coef(regfit.best,23)
##        (Intercept)         inequality         median_inc            hs_grad 
##           6.52e+00           1.52e-01           2.05e-06           5.91e-01 
##            college             unempl      child_poverty      single_parent 
##          -2.69e+00          -3.96e+00           2.68e+00          -1.70e+00 
##     severe_housing         food_index       mh_providers pop_provider_ratio 
##          -3.53e-01          -4.29e-02           3.84e+01          -1.66e-05 
##                pop        pct_below18          pct_black      pct_native_am 
##           7.13e-08          -2.02e+00          -4.23e+00          -3.33e+00 
##          pct_asian        pct_pacific       pct_hispanic          pct_white 
##          -6.52e+00          -8.41e+00          -5.08e+00          -4.24e+00 
##         pct_female          pct_rural       region_abbMW        region_abbS 
##           6.32e+00          -1.90e-01          -1.54e-01           3.00e-02

Target Variable: mental_distress_rate

1.Perform Linear Regression with All Predictors

We performed linear regression with all predictors having mental distress rate as a target variable and excluding mental health days to set the base adjusted R2 for comparison.

#1.Perform Linear Regression with All Predictors
#Before selecting the best subset of predictors for our regression, let’s run a simple linear regression on our dataset with all predictors to set the base adjusted r² for comparison.

lm3 <- lm(rankedfs,formula=mental_distress_rate ~. -mental_health_days )
summ(lm3)
Observations 15728
Dependent variable mental_distress_rate
Type OLS linear regression
F(24,15703) 714.42
0.52
Adj. R² 0.52
Est. S.E. t val. p
(Intercept) 0.21 0.02 14.06 0.00
inequality 0.01 0.00 22.69 0.00
median_inc 0.00 0.00 2.84 0.00
hs_grad 0.02 0.00 10.56 0.00
college -0.09 0.00 -43.94 0.00
unempl -0.21 0.01 -23.33 0.00
child_poverty 0.11 0.00 28.67 0.00
single_parent -0.07 0.00 -31.66 0.00
severe_housing -0.01 0.00 -1.63 0.10
food_index -0.00 0.00 -8.23 0.00
mh_providers 1.11 0.10 10.75 0.00
pop_provider_ratio -0.00 0.00 -10.57 0.00
pop 0.00 0.00 2.64 0.01
pct_below18 -0.06 0.01 -12.08 0.00
pct_black -0.14 0.01 -9.14 0.00
pct_native_am -0.10 0.02 -6.31 0.00
pct_asian -0.23 0.02 -13.69 0.00
pct_pacific -0.25 0.05 -5.04 0.00
pct_hispanic -0.17 0.01 -11.64 0.00
pct_white -0.15 0.01 -10.12 0.00
pct_female 0.21 0.01 27.50 0.00
pct_rural -0.00 0.00 -6.72 0.00
region_abbNE 0.00 0.00 3.02 0.00
region_abbS 0.00 0.00 8.47 0.00
region_abbW 0.00 0.00 5.03 0.00
Standard errors: OLS

With all of our variables included in the model, we can see that the base adjusted R2 is 0.521 and the residual standard error is 0.0159.

# feature selection(mental distress rate)
reg2.best10 <- regsubsets(mental_distress_rate~. -mental_health_days, data = rankedfs, nvmax = NULL, nbest = 1, method = "exhaustive")
                             
reg2.best10                   
summary2.out <- summary(reg2.best10)
as.data.frame(summary2.out$outmat)
##           inequality median_inc hs_grad college unempl child_poverty
## 1  ( 1 )                                                           *
## 2  ( 1 )                                      *                    *
## 3  ( 1 )           *                          *                     
## 4  ( 1 )           *                          *                     
## 5  ( 1 )           *                          *                    *
## 6  ( 1 )           *                          *                    *
## 7  ( 1 )           *                          *      *             *
## 8  ( 1 )           *                          *      *             *
## 9  ( 1 )           *                          *      *             *
## 10  ( 1 )          *                          *      *             *
## 11  ( 1 )          *                          *      *             *
## 12  ( 1 )          *                          *      *             *
## 13  ( 1 )          *                          *      *             *
## 14  ( 1 )          *                          *      *             *
## 15  ( 1 )          *                  *       *      *             *
## 16  ( 1 )          *                  *       *      *             *
## 17  ( 1 )          *                  *       *      *             *
## 18  ( 1 )          *                  *       *      *             *
## 19  ( 1 )          *                  *       *      *             *
## 20  ( 1 )          *          *       *       *      *             *
## 21  ( 1 )          *          *       *       *      *             *
## 22  ( 1 )          *          *       *       *      *             *
## 23  ( 1 )          *          *       *       *      *             *
## 24  ( 1 )          *          *       *       *      *             *
##           single_parent severe_housing food_index mh_providers
## 1  ( 1 )                                                      
## 2  ( 1 )                                                      
## 3  ( 1 )                                                      
## 4  ( 1 )                                                      
## 5  ( 1 )              *                                       
## 6  ( 1 )              *                                       
## 7  ( 1 )              *                                       
## 8  ( 1 )              *                                      *
## 9  ( 1 )              *                                      *
## 10  ( 1 )             *                                      *
## 11  ( 1 )             *                                      *
## 12  ( 1 )             *                         *            *
## 13  ( 1 )             *                         *            *
## 14  ( 1 )             *                                      *
## 15  ( 1 )             *                                      *
## 16  ( 1 )             *                                      *
## 17  ( 1 )             *                         *            *
## 18  ( 1 )             *                         *            *
## 19  ( 1 )             *                         *            *
## 20  ( 1 )             *                         *            *
## 21  ( 1 )             *                         *            *
## 22  ( 1 )             *                         *            *
## 23  ( 1 )             *              *          *            *
## 24  ( 1 )             *              *          *            *
##           pop_provider_ratio pop pct_below18 pct_black pct_native_am pct_asian
## 1  ( 1 )                                                                      
## 2  ( 1 )                                                                      
## 3  ( 1 )                                                                      
## 4  ( 1 )                                                                      
## 5  ( 1 )                                                                      
## 6  ( 1 )                                                                      
## 7  ( 1 )                                                                      
## 8  ( 1 )                                                                      
## 9  ( 1 )                                                           *          
## 10  ( 1 )                                                          *          
## 11  ( 1 )                  *                                       *          
## 12  ( 1 )                  *                                       *          
## 13  ( 1 )                  *               *                       *          
## 14  ( 1 )                  *               *         *                       *
## 15  ( 1 )                  *               *         *                       *
## 16  ( 1 )                  *               *         *                       *
## 17  ( 1 )                  *               *         *                       *
## 18  ( 1 )                  *               *         *                       *
## 19  ( 1 )                  *               *         *             *         *
## 20  ( 1 )                  *               *         *             *         *
## 21  ( 1 )                  *               *         *             *         *
## 22  ( 1 )                  *   *           *         *             *         *
## 23  ( 1 )                  *   *           *         *             *         *
## 24  ( 1 )                  *   *           *         *             *         *
##           pct_pacific pct_hispanic pct_white pct_female pct_rural region_abbMW
## 1  ( 1 )                                                                      
## 2  ( 1 )                                                                      
## 3  ( 1 )                                              *                       
## 4  ( 1 )                         *                    *                       
## 5  ( 1 )                                              *                       
## 6  ( 1 )                         *                    *                       
## 7  ( 1 )                         *                    *                       
## 8  ( 1 )                         *                    *                       
## 9  ( 1 )                         *                    *                       
## 10  ( 1 )                        *                    *                       
## 11  ( 1 )                        *                    *                       
## 12  ( 1 )                        *                    *                       
## 13  ( 1 )                        *                    *                       
## 14  ( 1 )                        *         *          *                       
## 15  ( 1 )                        *         *          *                       
## 16  ( 1 )                        *         *          *         *            *
## 17  ( 1 )                        *         *          *         *            *
## 18  ( 1 )                        *         *          *         *            *
## 19  ( 1 )           *            *         *          *         *            *
## 20  ( 1 )           *            *         *          *         *            *
## 21  ( 1 )           *            *         *          *         *            *
## 22  ( 1 )           *            *         *          *         *            *
## 23  ( 1 )           *            *         *          *         *            *
## 24  ( 1 )           *            *         *          *         *            *
##           region_abbNE region_abbS region_abbW
## 1  ( 1 )                                      
## 2  ( 1 )                                      
## 3  ( 1 )                                      
## 4  ( 1 )                                      
## 5  ( 1 )                                      
## 6  ( 1 )                                      
## 7  ( 1 )                                      
## 8  ( 1 )                                      
## 9  ( 1 )                                      
## 10  ( 1 )                        *            
## 11  ( 1 )                        *            
## 12  ( 1 )                        *            
## 13  ( 1 )                        *            
## 14  ( 1 )                        *            
## 15  ( 1 )                        *            
## 16  ( 1 )                                     
## 17  ( 1 )                                     
## 18  ( 1 )                        *            
## 19  ( 1 )                                     
## 20  ( 1 )                                     
## 21  ( 1 )                        *            
## 22  ( 1 )                        *            
## 23  ( 1 )            *                        
## 24  ( 1 )            *           *
plot(reg2.best10, scale = "adjr2", main = "Adjusted R^2")

plot(reg2.best10, scale = "r2", main = "R^2")

plot(reg2.best10, scale = "bic", main = "BIC")

plot(reg2.best10, scale = "Cp", main = "Cp")

coef(reg2.best10, 10, scale = "adjr2") # default BIC
##   (Intercept)    inequality       college        unempl child_poverty 
##       0.05396       0.00597      -0.07621      -0.20307       0.11434 
## single_parent  mh_providers pct_native_am  pct_hispanic    pct_female 
##      -0.05768       1.86752       0.04196      -0.02817       0.17954 
##   region_abbS 
##       0.00491

The summary table below provides details on which predictors to use for the model. The best predictors are indicated by the word “true.” And then we selected the best 10 features with the method adjusted R2 to build the model.

summary2.out$which[10,]
   (Intercept)         inequality         median_inc            hs_grad 
          TRUE               TRUE              FALSE              FALSE 
       college             unempl      child_poverty      single_parent 
          TRUE               TRUE               TRUE               TRUE 
severe_housing         food_index       mh_providers pop_provider_ratio 
         FALSE              FALSE               TRUE              FALSE 
           pop        pct_below18          pct_black      pct_native_am 
         FALSE              FALSE              FALSE               TRUE 
     pct_asian        pct_pacific       pct_hispanic          pct_white 
         FALSE              FALSE               TRUE              FALSE 
    pct_female          pct_rural       region_abbMW       region_abbNE 
          TRUE              FALSE              FALSE              FALSE 
   region_abbS        region_abbW 
          TRUE              FALSE 

2.Perform Linear Regression with selected best 10 variables after feature selection

lm4 <- lm(rankedfs,formula=mental_distress_rate ~  inequality + college + mh_providers + pct_female + child_poverty  + unempl + single_parent + region_abb + pct_hispanic + pct_native_am)
summ(lm4)
Observations 15728
Dependent variable mental_distress_rate
Type OLS linear regression
F(12,15715) 1296.98
0.50
Adj. R² 0.50
Est. S.E. t val. p
(Intercept) 0.05 0.00 15.75 0.00
inequality 0.01 0.00 24.10 0.00
college -0.08 0.00 -41.58 0.00
mh_providers 1.74 0.10 18.27 0.00
pct_female 0.18 0.01 24.99 0.00
child_poverty 0.12 0.00 37.54 0.00
unempl -0.21 0.01 -23.80 0.00
single_parent -0.06 0.00 -30.43 0.00
region_abbNE 0.00 0.00 3.77 0.00
region_abbS 0.01 0.00 16.10 0.00
region_abbW 0.00 0.00 4.04 0.00
pct_hispanic -0.03 0.00 -27.70 0.00
pct_native_am 0.04 0.00 19.28 0.00
Standard errors: OLS

In model with best 10 selected feature, adjusted r2 has declined significantly to 0.497.

3. Plot Output from regsubsets Function in leaps package

res.legend <-
    subsets(regsubsets(mental_distress_rate~. -mental_health_days, data = rankedfs, nvmax = 10, nbest = 1, method = "exhaustive"), statistic="adjr2", legend = FALSE, min.size = 5, main = "Adjusted R^2")

## Mallow Cp
res.legend <-
    subsets(regsubsets(mental_distress_rate~. -mental_health_days, data = rankedfs, nvmax = 10, nbest = 1, method = "exhaustive"), statistic="cp", legend = FALSE, min.size = 5, main = "Mallow Cp")
abline(a = 1, b = 1, lty = 2)

#BIC
res.legend <-
    subsets(regsubsets(mental_distress_rate~. -mental_health_days, data = rankedfs, nvmax = 10, nbest = 1, method = "exhaustive"), statistic="bic", legend = FALSE, min.size = 5, main = "BIC")
abline(a = 1, b = 1, lty = 2)

#r2
res.legend <-
    subsets(regsubsets(mental_distress_rate~. -mental_health_days, data = rankedfs, nvmax = 10, nbest = 1, method = "exhaustive"), statistic="rsq", legend = FALSE, min.size = 5, main = "R square")
abline(a = 1, b = 1, lty = 2)

res.legend
##                    Abbreviation
## inequality                    i
## median_inc                  md_
## hs_grad                       h
## college                      cl
## unempl                        u
## child_poverty                c_
## single_parent               sn_
## severe_housing              sv_
## food_index                    f
## mh_providers                mh_
## pop_provider_ratio         pp__
## pop                          pp
## pct_below18                 p_1
## pct_black                 pct_b
## pct_native_am              pc__
## pct_asian                 pct_s
## pct_pacific               pct_p
## pct_hispanic              pct_h
## pct_white                 pct_w
## pct_female                pct_f
## pct_rural                 pct_r
## region_abbMW                r_M
## region_abbNE                r_N
## region_abbS                 r_S
## region_abbW                 r_W
#See which model has the highest R Square :
rate_maxR2 <- which.max(summary2.out$rsq)
rate_maxR2

As we can see, at 24 features, we are getting the highest R2 and we have plotted the graph below.

plot(summary2.out$adjr2,xlab="number of variables", ylab="R Squared", type="l")
points(rate_maxR2, summary2.out$rsq[rate_maxR2], col='red', cex=2, pch=20)

#See which model has the highest adjusted R2 :
rate_maxAdjR2 <- which.max(summary2.out$adjr2)
rate_maxAdjR2

As we can see, at 24 features, we are getting the highest adjusted R2 and we have plotted the graph below.

plot(summary2.out$adjr2,xlab="number of variables", ylab="adjr2", type="l")
points(rate_maxAdjR2, summary2.out$adjr2[rate_maxAdjR2], col='red', cex=2, pch=20)

#See which model has the lowest CP :
rate_minCP <- which.min(summary2.out$cp)
rate_minCP

As we can see, at 23 features, we are getting the lowest CP and we have plotted the graph below.

plot(summary2.out$cp,xlab="number of variables", ylab="cp", type="l")
points(rate_minCP, summary2.out$cp[rate_minCP], col='red', cex=2, pch=20)

#See which model has the lowest BIC :
rate_minBIC <- which.min(summary2.out$bic)
rate_minBIC
## [1] 19

As we can see, at 19 features, we are getting the lowest BIC and we have plotted the graph below.

plot(summary2.out$bic,xlab="number of variables", ylab="BIC", type="l")
points(rate_minBIC, summary2.out$bic[rate_minBIC], col='red', cex=2, pch=20)

#forward and backward selection is same as exhaustive just slight different in forward it start will no variable and iterate till the max variable and backward selection is otherway around.
#validation
set.seed(1)
train=sample(c(TRUE, FALSE), nrow(rankedfs), rep=T)
test = (!train)
regfit.best=regsubsets(mental_distress_rate~.-mental_health_days,data=rankedfs[train,],nvmax = 23)
test.mat = model.matrix(mental_distress_rate~.-mental_health_days, data = rankedfs[test,])
val.errors = rep(NA, 23)
for(i in 1:23){
  coefi = coef(regfit.best, id=i)
   pred = test.mat[,names(coefi)]%*%coefi
   val.errors[i]= mean((rankedfs$mental_health_days[test]-pred)^2)
   }
val.errors
##  [1] 16.1 16.1 16.1 16.1 16.1 16.1 16.1 16.1 16.1 16.1 16.1 16.1 16.1 16.1 16.1
## [16] 16.1 16.1 16.1 16.1 16.1 16.1 16.1 16.1
which.min(val.errors)
## [1] 13
coef(regfit.best,23)
##        (Intercept)         inequality         median_inc            hs_grad 
##           1.86e-01           5.57e-03           4.96e-08           2.15e-02 
##            college             unempl      child_poverty      single_parent 
##          -8.58e-02          -1.94e-01           1.04e-01          -7.10e-02 
##         food_index       mh_providers pop_provider_ratio                pop 
##          -1.93e-03           1.14e+00          -5.79e-07           1.13e-09 
##        pct_below18          pct_black      pct_native_am          pct_asian 
##          -6.20e-02          -1.06e-01          -6.75e-02          -1.97e-01 
##        pct_pacific       pct_hispanic          pct_white         pct_female 
##          -2.04e-01          -1.39e-01          -1.21e-01           2.06e-01 
##          pct_rural       region_abbMW       region_abbNE        region_abbS 
##          -3.46e-03          -2.69e-03          -1.19e-03           9.11e-04

Linear Models

After the feature selection exercises, we continued with analyzing our data using linear regression models. We assessed two sets of models with different dependent variables. The first set of models used mental_health_days as the dependent variable, while the second set used mental_distress_rate as the dependent variable. Within these sets of models, we compared the models that were produced with feature selection to several other potential combinations of variables. We also considered the variance inflation factors (VIFs) for models to address the issue of multicollinearity.

lm_days_fs <- lm2

lm_rate_fs <- lm4

Dependent variable: mental_health_days

We arranged several different combinations of variables to evaluate what different might make a suitable linear model with mental_health_days as the dependent variable. First, we regressed mental_health_days on the independent variable of interest – inequality – and a factor variable for region. Next, we included a factor variable for year to control for unobserved differences over time. Third, we integrated a vector of economic variables that also might relate to mental health outcomes, as well as a vector of demographic variables.

# v1: base specification
lm_days_1 <- lm(mental_health_days ~ inequality + region, data = ranked)
summ(lm_days_1, vifs = T)
Observations 18465 (5 missing obs. deleted)
Dependent variable mental_health_days
Type OLS linear regression
F(4,18460) 1341.28
0.23
Adj. R² 0.23
Est. S.E. t val. p VIF
(Intercept) 2.68 0.03 93.25 0.00 NA
inequality 0.25 0.01 37.33 0.00 1.17
regionNortheast 0.21 0.02 11.35 0.00 1.17
regionSouth 0.47 0.01 42.89 0.00 1.17
regionWest 0.12 0.01 8.40 0.00 1.17
Standard errors: OLS
# v2: include year as factor variable
lm_days_2 <- lm(mental_health_days ~ inequality + region + year, data = ranked)
summ(lm_days_2, vifs = T)
Observations 18465 (5 missing obs. deleted)
Dependent variable mental_health_days
Type OLS linear regression
F(9,18455) 1672.85
0.45
Adj. R² 0.45
Est. S.E. t val. p VIF
(Intercept) 2.32 0.03 90.60 0.00 NA
inequality 0.25 0.01 44.51 0.00 1.17
regionNortheast 0.21 0.02 13.44 0.00 1.17
regionSouth 0.47 0.01 50.79 0.00 1.17
regionWest 0.12 0.01 9.83 0.00 1.17
year2017 0.10 0.01 7.73 0.00 1.00
year2018 0.25 0.01 19.45 0.00 1.00
year2019 0.26 0.01 19.53 0.00 1.00
year2020 0.49 0.01 37.50 0.00 1.00
year2021 1.00 0.01 76.30 0.00 1.00
Standard errors: OLS
# v3: include economic and demographic vars
lm_days_3 <- lm(mental_health_days ~ inequality + college + hs_grad + unempl
                + child_poverty + single_parent + severe_housing + food_index
                + median_inc + pop_provider_ratio
                + pop + pct_below18 + pct_female + pct_rural
                + pct_black + pct_native_am + pct_asian + pct_pacific + pct_hispanic + pct_white
                + region + year, data = ranked)
summ(lm_days_3, vifs = T)
Observations 15728 (2742 missing obs. deleted)
Dependent variable mental_health_days
Type OLS linear regression
F(28,15699) 1598.75
0.74
Adj. R² 0.74
Est. S.E. t val. p VIF
(Intercept) 5.86 0.32 18.59 0.00 NA
inequality 0.07 0.01 13.49 0.00 2.01
college -1.64 0.04 -38.79 0.00 3.12
hs_grad 0.15 0.04 3.46 0.00 1.52
unempl 5.19 0.21 24.50 0.00 2.25
child_poverty 1.34 0.08 16.05 0.00 7.54
single_parent 0.39 0.05 7.26 0.00 3.90
severe_housing 0.39 0.09 4.30 0.00 2.17
food_index -0.04 0.00 -8.54 0.00 3.07
median_inc -0.00 0.00 -15.86 0.00 5.51
pop_provider_ratio -0.00 0.00 -6.53 0.00 1.28
pop 0.00 0.00 8.59 0.00 1.49
pct_below18 0.90 0.11 8.16 0.00 1.76
pct_female 3.39 0.16 20.90 0.00 1.47
pct_rural -0.22 0.01 -15.29 0.00 2.40
pct_black -4.78 0.31 -15.32 0.00 273.52
pct_native_am -4.13 0.33 -12.57 0.00 61.64
pct_asian -3.95 0.36 -10.93 0.00 14.94
pct_pacific -12.20 1.05 -11.67 0.00 3.04
pct_hispanic -4.97 0.30 -16.48 0.00 216.22
pct_white -3.65 0.31 -11.62 0.00 516.34
regionNortheast 0.25 0.01 20.91 0.00 3.44
regionSouth 0.32 0.01 38.78 0.00 3.44
regionWest 0.19 0.01 17.58 0.00 3.44
year2017 0.16 0.01 16.51 0.00 1.98
year2018 0.38 0.01 38.05 0.00 1.98
year2019 0.44 0.01 42.46 0.00 1.98
year2020 0.72 0.01 67.04 0.00 1.98
year2021 1.32 0.01 114.25 0.00 1.98
Standard errors: OLS
# v4: feature selection
summ(lm_days_fs, vifs = T)
Observations 15728
Dependent variable mental_health_days
Type OLS linear regression
F(12,15715) 1120.94
0.46
Adj. R² 0.46
Est. S.E. t val. p VIF
(Intercept) 2.12 0.10 20.72 0.00 NA
child_poverty 2.73 0.09 29.40 0.00 4.47
inequality 0.15 0.01 20.79 0.00 1.80
unempl -4.09 0.26 -15.50 0.00 1.68
college -2.47 0.05 -45.05 0.00 2.52
mh_providers 50.86 3.00 16.98 0.00 1.47
pop_provider_ratio -0.00 0.00 -13.25 0.00 1.29
region_abbNE 0.16 0.02 10.02 0.00 1.99
region_abbS 0.22 0.01 20.70 0.00 1.99
region_abbW 0.13 0.01 9.17 0.00 1.99
single_parent -1.57 0.06 -27.81 0.00 2.11
pct_hispanic -1.08 0.03 -33.96 0.00 1.16
pct_female 5.45 0.22 25.23 0.00 1.25
Standard errors: OLS

For the first (base) model, the equations for each region are as follows:

  • regionMidwest: mental_health_days = 2.676 + 0.247 * inequality
  • regionNortheast: mental_health_days = 2.676 + 0.212 + 0.247 * inequality
  • regionSouth: mental_health_days = 2.676 + 0.472 + 0.247 * inequality
  • regionWest: mental_health_days = 2.676 + 0.121 + 0.247 * inequality

Similar to the initial results from our EDA, the Midwest region (which is included in the intercept) has the lowest estimate for mental_health_days and the South has the highest. All the coefficients are statistically significant. Our measure for income inequality is statistically significant at the 0.001 level with a coefficient of 0.247, indicating a positive association with poor mental health days.

After including the year factor variables, the economic variables, and the demographic variables, the coefficients on all variables remained statistically significant. However, the coefficient on inequality declined to 0.072 in the third model (which included the most variables). This suggests that the base model exhibited omitted variable bias that the fuller model at least partially addressed. The South still had the largest coefficient among the region factor variables.

In general, The other economic variables demonstrated relationships with mental health that were expected. For instance, higher rates forunempl, child_poverty, single_parent, and severe_housing were all associated with worse mental health. A better food environment, higher median household incomes, and having a larger percentage of the population with some post-secondary education were associated with better mental health. Interestingly, a higher high-school graduation rate was positively associated with poor mental health.

The following table shows the regression results for the models discussed above (#1, #2, and #3) and the model produced by feature selection (#4).

tables_a_c <- c("(Intercept)", "inequality", 
                       "Northeast" = "regionNortheast", "South" = "regionSouth", "West" = "regionWest", 
                       "college", "hs_grad", "unempl", "child_poverty", "single_parent", 
                       "severe_housing", "food_index", "median_inc", 
                       "pop_provider_ratio", "pop")

export_summs(lm_days_1, lm_days_2, lm_days_3, lm_days_fs, 
             #error_format = "({p.value})",
             statistics = c(Num.obs = "nobs", R2 = "r.squared", adj.R2 = "adj.r.squared"), 
             model.names = c("1", "2", "3", "4"), 
             coefs = tables_a_c
             )
1234
(Intercept)2.68 ***2.32 ***5.86 ***2.12 ***
(0.03)   (0.03)   (0.32)   (0.10)   
inequality0.25 ***0.25 ***0.07 ***0.15 ***
(0.01)   (0.01)   (0.01)   (0.01)   
Northeast0.21 ***0.21 ***0.25 ***       
(0.02)   (0.02)   (0.01)          
South0.47 ***0.47 ***0.32 ***       
(0.01)   (0.01)   (0.01)          
West0.12 ***0.12 ***0.19 ***       
(0.01)   (0.01)   (0.01)          
college              -1.64 ***-2.47 ***
              (0.04)   (0.05)   
hs_grad              0.15 ***       
              (0.04)          
unempl              5.19 ***-4.09 ***
              (0.21)   (0.26)   
child_poverty              1.34 ***2.73 ***
              (0.08)   (0.09)   
single_parent              0.39 ***-1.57 ***
              (0.05)   (0.06)   
severe_housing              0.39 ***       
              (0.09)          
food_index              -0.04 ***       
              (0.00)          
median_inc              -0.00 ***       
              (0.00)          
pop_provider_ratio              -0.00 ***-0.00 ***
              (0.00)   (0.00)   
pop              0.00 ***       
              (0.00)          
Num.obs18465       18465       15728       15728       
R20.23    0.45    0.74    0.46    
adj.R20.23    0.45    0.74    0.46    
*** p < 0.001; ** p < 0.01; * p < 0.05.

The model produced by feature selection has a relatively low adjusted R2, compared to the models with the full set of control variables. As a result, we decided to focus on the models that explain a larger degree of variance.

After adding so many different variables, we want to check for multicollinearity. Thus, we also evaluated the VIFs for each model. The majority of VIF results are below 5, which is acceptable. However, several variables have extremely high VIFs – specifically, the demographic variables for ethnic groups. The following model removed pct_white, which improved the VIFs significantly. It also removed child_poverty, which had a relatively high VIF of NA. After removing the collinear variables, the VIFs of the model were all below 5.

# v5: remove vars to reduce vif
lm_days_5 <- lm(mental_health_days ~ inequality + college + hs_grad + unempl
                + single_parent + severe_housing + food_index
                + pop + pop_provider_ratio + pct_female + pct_rural + pct_below18
                + pct_black + pct_native_am + pct_asian + pct_pacific + pct_hispanic
                + region + year, data = ranked)
summ(lm_days_5, vifs = T)
Observations 15728 (2742 missing obs. deleted)
Dependent variable mental_health_days
Type OLS linear regression
F(25,15702) 1623.59
0.72
Adj. R² 0.72
Est. S.E. t val. p VIF
(Intercept) 2.39 0.09 25.39 0.00 NA
inequality 0.12 0.01 22.51 0.00 1.81
college -2.19 0.04 -55.69 0.00 2.52
hs_grad -0.04 0.04 -0.80 0.42 1.50
unempl 6.37 0.21 30.12 0.00 2.08
single_parent 0.92 0.05 17.60 0.00 3.51
severe_housing 0.47 0.09 5.09 0.00 2.16
food_index -0.10 0.00 -25.17 0.00 2.45
pop 0.00 0.00 9.14 0.00 1.49
pop_provider_ratio -0.00 0.00 -7.84 0.00 1.27
pct_female 4.17 0.17 25.11 0.00 1.43
pct_rural -0.13 0.01 -9.32 0.00 2.23
pct_below18 0.48 0.11 4.25 0.00 1.71
pct_black -1.35 0.04 -38.57 0.00 3.22
pct_native_am -0.65 0.06 -11.14 0.00 1.79
pct_asian -1.13 0.14 -7.81 0.00 2.23
pct_pacific -1.84 0.75 -2.47 0.01 1.45
pct_hispanic -1.43 0.03 -48.60 0.00 1.91
regionNortheast 0.18 0.01 15.24 0.00 3.02
regionSouth 0.32 0.01 39.22 0.00 3.02
regionWest 0.14 0.01 12.27 0.00 3.02
year2017 0.16 0.01 15.98 0.00 1.89
year2018 0.39 0.01 37.62 0.00 1.89
year2019 0.44 0.01 41.84 0.00 1.89
year2020 0.72 0.01 65.10 0.00 1.89
year2021 1.33 0.01 112.53 0.00 1.89
Standard errors: OLS

To permit the coefficients for inequality to differ by region, we included an interaction term between inequality and region.

# v6: base model by region with interaction terms
lm_days_6 <- lm(mental_health_days ~ inequality*region, data = ranked)
summ(lm_days_6)
Observations 18465 (5 missing obs. deleted)
Dependent variable mental_health_days
Type OLS linear regression
F(7,18457) 787.92
0.23
Adj. R² 0.23
Est. S.E. t val. p
(Intercept) 2.36 0.06 40.88 0.00
inequality 0.32 0.01 23.45 0.00
regionNortheast 1.61 0.13 12.29 0.00
regionSouth 0.80 0.07 11.17 0.00
regionWest 0.41 0.10 4.20 0.00
inequality:regionNortheast -0.31 0.03 -10.80 0.00
inequality:regionSouth -0.08 0.02 -4.83 0.00
inequality:regionWest -0.07 0.02 -3.10 0.00
Standard errors: OLS

The equations for each region with interaction terms are as follows:

  • regionMidwest: mental_health_days = 2.364 + 0.321 * inequality
  • regionNortheast: mental_health_days = 2.364 + 1.613 + (0.321 + -0.314) * inequality
  • regionSouth: mental_health_days = 2.364 + 0.803 + (0.321 + -0.079) * inequality
  • regionWest: mental_health_days = 2.364 + 0.41 + (0.321 + -0.069) * inequality

These results suggest that the strength of the relationship between mental_health_days and inequality differs by region. In fact, the coefficient on inequality is near zero for the Northeast. Conversely, the Midwest exhibits the strongest association. All the independent variables are statistically significant.

# v7: larger model with interaction terms
lm_days_7 <- lm(mental_health_days ~ inequality*region + college + hs_grad + unempl
                + single_parent + severe_housing + food_index
                + pop + pop_provider_ratio + pct_female + pct_rural + pct_below18
                + pct_black + pct_native_am + pct_asian + pct_pacific + pct_hispanic
                + year, data = ranked) #, terms = 'marginal')
summ(lm_days_7)
Observations 15728 (2742 missing obs. deleted)
Dependent variable mental_health_days
Type OLS linear regression
F(28,15699) 1460.05
0.72
Adj. R² 0.72
Est. S.E. t val. p
(Intercept) 2.13 0.10 20.84 0.00
inequality 0.18 0.01 17.78 0.00
regionNortheast 0.45 0.08 5.52 0.00
regionSouth 0.73 0.05 14.72 0.00
regionWest 0.19 0.06 2.86 0.00
college -2.19 0.04 -55.86 0.00
hs_grad -0.03 0.04 -0.68 0.50
unempl 6.39 0.21 30.26 0.00
single_parent 0.92 0.05 17.50 0.00
severe_housing 0.44 0.09 4.74 0.00
food_index -0.10 0.00 -24.87 0.00
pop 0.00 0.00 8.84 0.00
pop_provider_ratio -0.00 0.00 -8.03 0.00
pct_female 4.06 0.17 24.40 0.00
pct_rural -0.12 0.01 -8.54 0.00
pct_below18 0.63 0.11 5.50 0.00
pct_black -1.31 0.04 -36.96 0.00
pct_native_am -0.76 0.06 -12.65 0.00
pct_asian -1.25 0.15 -8.59 0.00
pct_pacific -1.53 0.75 -2.05 0.04
pct_hispanic -1.44 0.03 -48.51 0.00
year2017 0.16 0.01 16.04 0.00
year2018 0.39 0.01 37.72 0.00
year2019 0.44 0.01 41.95 0.00
year2020 0.72 0.01 65.34 0.00
year2021 1.33 0.01 112.82 0.00
inequality:regionNortheast -0.06 0.02 -3.40 0.00
inequality:regionSouth -0.09 0.01 -8.26 0.00
inequality:regionWest -0.01 0.01 -0.81 0.42
Standard errors: OLS
# check vifs but use "high-order" because of interaction terms
# this method computes a GVIF for each high-order term in the model
# GVIF^[1/(2*df)], the last of which is intended to be comparable across terms of different dimension
vif(lm_days_7, type = "high-order")
                   GVIF Df GVIF^(1/(2*Df))

inequality 6.91e+00 1 2.63 region 1.43e+05 3 7.23 college 2.52e+00 1 1.59 hs_grad 1.50e+00 1 1.23 unempl 2.09e+00 1 1.44 single_parent 3.52e+00 1 1.88 severe_housing 2.17e+00 1 1.47 food_index 2.47e+00 1 1.57 pop 1.50e+00 1 1.22 pop_provider_ratio 1.28e+00 1 1.13 pct_female 1.44e+00 1 1.20 pct_rural 2.25e+00 1 1.50 pct_below18 1.77e+00 1 1.33 pct_black 3.29e+00 1 1.81 pct_native_am 1.91e+00 1 1.38 pct_asian 2.27e+00 1 1.51 pct_pacific 1.45e+00 1 1.21 pct_hispanic 1.95e+00 1 1.40 year 1.89e+00 5 1.07 inequality:region 1.92e+05 3 7.60

The equations for the fuller models with interaction terms are below, where Vc is a vector of additional control variables:

  • regionMidwest: mental_health_days = 2.13 + 0.182 * inequality + Vc
  • regionNortheast: mental_health_days = 2.13 + 0.448 + (0.182 - 0.0616) * inequality + Vc
  • regionSouth: mental_health_days = 2.13 + 0.734 + (0.182 - 0.0944) * inequality + Vc
  • regionWest: mental_health_days = 2.13 + 0.186 + (0.182 - 0.0121) * inequality + Vc

After including the set of control variables, the coefficients for inequality still differ by region. However, now it appears that the South has the weakest association between inequality and mental_health_days and the Midwest still has the strongest.

The following table shows the regression results for models #1, #5, #6, and #7 (from above).

tables_b_d <- c("(Intercept)", "inequality", 
                       "Northeast" = "regionNortheast", "South" = "regionSouth", "West" = "regionWest", 
                       "inequality:NE" = "inequality:regionNortheast", 
                       "inequality:SO" = "inequality:regionSouth", 
                       "inequality:WE" = "inequality:regionWest",
                       "college", "hs_grad", "unempl", "single_parent", 
                        "severe_housing", "food_index")

export_summs(lm_days_1, lm_days_5, lm_days_6, lm_days_7, 
             statistics = c(Num.obs = "nobs", R2 = "r.squared", adj.R2 = "adj.r.squared"), 
             model.names = c("1", "5", "6", "7"), 
             coefs = tables_b_d
             )
1567
(Intercept)2.68 ***2.39 ***2.36 ***2.13 ***
(0.03)   (0.09)   (0.06)   (0.10)   
inequality0.25 ***0.12 ***0.32 ***0.18 ***
(0.01)   (0.01)   (0.01)   (0.01)   
Northeast0.21 ***0.18 ***1.61 ***0.45 ***
(0.02)   (0.01)   (0.13)   (0.08)   
South0.47 ***0.32 ***0.80 ***0.73 ***
(0.01)   (0.01)   (0.07)   (0.05)   
West0.12 ***0.14 ***0.41 ***0.19 ** 
(0.01)   (0.01)   (0.10)   (0.06)   
inequality:NE              -0.31 ***-0.06 ***
              (0.03)   (0.02)   
inequality:SO              -0.08 ***-0.09 ***
              (0.02)   (0.01)   
inequality:WE              -0.07 ** -0.01    
              (0.02)   (0.01)   
college       -2.19 ***       -2.19 ***
       (0.04)          (0.04)   
hs_grad       -0.04           -0.03    
       (0.04)          (0.04)   
unempl       6.37 ***       6.39 ***
       (0.21)          (0.21)   
single_parent       0.92 ***       0.92 ***
       (0.05)          (0.05)   
severe_housing       0.47 ***       0.44 ***
       (0.09)          (0.09)   
food_index       -0.10 ***       -0.10 ***
       (0.00)          (0.00)   
Num.obs18465       15728       18465       15728       
R20.23    0.72    0.23    0.72    
adj.R20.23    0.72    0.23    0.72    
*** p < 0.001; ** p < 0.01; * p < 0.05.

After removing the collinear variables, the larger models still perform well, with an adjusted R2 of 0.72. Including the interaction terms between inequality and region do not change the adjusted R2, but they suggest that there are meaningful differences across regions for the relationship between mental health and inequality.

Dependent variable: mental_distress_rate

Next, we considered a series of linear models with mental_distress_rate as the dependent variable. In the following models, we used different combinations of variables to find the most suitable model. Just as above, we also incorporated the model produced by feature selection.

# x1: base model by region
lm_rate_1 <- lm(mental_distress_rate ~ inequality + region, data = ranked)
summ(lm_rate_1, vifs = T)
Observations 18465 (5 missing obs. deleted)
Dependent variable mental_distress_rate
Type OLS linear regression
F(4,18460) 1465.34
0.24
Adj. R² 0.24
Est. S.E. t val. p VIF
(Intercept) 0.07 0.00 75.33 0.00 NA
inequality 0.01 0.00 45.00 0.00 1.17
regionNortheast 0.00 0.00 1.66 0.10 1.17
regionSouth 0.01 0.00 37.84 0.00 1.17
regionWest 0.00 0.00 5.32 0.00 1.17
Standard errors: OLS
# x2: include year as factor variable
lm_rate_2 <- lm(mental_distress_rate ~ inequality + region + year, data = ranked)
summ(lm_rate_2, vifs = T)
Observations 18465 (5 missing obs. deleted)
Dependent variable mental_distress_rate
Type OLS linear regression
F(9,18455) 2255.52
0.52
Adj. R² 0.52
Est. S.E. t val. p VIF
(Intercept) 0.06 0.00 73.30 0.00 NA
inequality 0.01 0.00 57.08 0.00 1.17
regionNortheast 0.00 0.00 2.07 0.04 1.17
regionSouth 0.01 0.00 47.68 0.00 1.17
regionWest 0.00 0.00 6.56 0.00 1.17
year2017 0.00 0.00 10.07 0.00 1.00
year2018 0.01 0.00 23.01 0.00 1.00
year2019 0.01 0.00 23.12 0.00 1.00
year2020 0.02 0.00 42.03 0.00 1.00
year2021 0.04 0.00 92.94 0.00 1.00
Standard errors: OLS
# x3: include economic and demographic vars
lm_rate_3 <- lm(mental_distress_rate ~ inequality + college + hs_grad + unempl
                + child_poverty + single_parent + severe_housing + food_index
                + median_inc + pop_provider_ratio
                + pop + pct_below18 + pct_female + pct_rural
                + pct_black + pct_native_am + pct_asian + pct_pacific + pct_hispanic + pct_white
                + region + year, data = ranked)
summ(lm_rate_3, vifs = T)
Observations 15728 (2742 missing obs. deleted)
Dependent variable mental_distress_rate
Type OLS linear regression
F(28,15699) 2565.65
0.82
Adj. R² 0.82
Est. S.E. t val. p VIF
(Intercept) 0.16 0.01 17.41 0.00 NA
inequality 0.00 0.00 17.91 0.00 2.01
college -0.05 0.00 -38.82 0.00 3.12
hs_grad 0.01 0.00 4.28 0.00 1.52
unempl 0.14 0.01 22.69 0.00 2.25
child_poverty 0.05 0.00 22.28 0.00 7.54
single_parent 0.01 0.00 6.82 0.00 3.90
severe_housing 0.02 0.00 8.86 0.00 2.17
food_index -0.00 0.00 -12.71 0.00 3.07
median_inc -0.00 0.00 -22.19 0.00 5.51
pop_provider_ratio -0.00 0.00 -6.50 0.00 1.28
pop 0.00 0.00 5.82 0.00 1.49
pct_below18 0.05 0.00 15.30 0.00 1.76
pct_female 0.10 0.00 20.74 0.00 1.47
pct_rural -0.00 0.00 -9.57 0.00 2.40
pct_black -0.12 0.01 -13.65 0.00 273.52
pct_native_am -0.09 0.01 -9.74 0.00 61.64
pct_asian -0.10 0.01 -9.64 0.00 14.94
pct_pacific -0.32 0.03 -10.83 0.00 3.04
pct_hispanic -0.13 0.01 -14.99 0.00 216.22
pct_white -0.09 0.01 -10.41 0.00 516.34
regionNortheast 0.00 0.00 14.24 0.00 3.44
regionSouth 0.01 0.00 36.50 0.00 3.44
regionWest 0.00 0.00 12.38 0.00 3.44
year2017 0.01 0.00 21.24 0.00 1.98
year2018 0.01 0.00 46.55 0.00 1.98
year2019 0.02 0.00 51.97 0.00 1.98
year2020 0.02 0.00 81.19 0.00 1.98
year2021 0.05 0.00 148.82 0.00 1.98
Standard errors: OLS
# x4: feature selection
summ(lm_rate_fs, vifs = T)
Observations 15728
Dependent variable mental_distress_rate
Type OLS linear regression
F(12,15715) 1296.98
0.50
Adj. R² 0.50
Est. S.E. t val. p VIF
(Intercept) 0.05 0.00 15.75 0.00 NA
inequality 0.01 0.00 24.10 0.00 1.81
college -0.08 0.00 -41.58 0.00 2.50
mh_providers 1.74 0.10 18.27 0.00 1.33
pct_female 0.18 0.01 24.99 0.00 1.25
child_poverty 0.12 0.00 37.54 0.00 4.47
unempl -0.21 0.01 -23.80 0.00 1.70
single_parent -0.06 0.00 -30.43 0.00 2.12
region_abbNE 0.00 0.00 3.77 0.00 2.06
region_abbS 0.01 0.00 16.10 0.00 2.06
region_abbW 0.00 0.00 4.04 0.00 2.06
pct_hispanic -0.03 0.00 -27.70 0.00 1.16
pct_native_am 0.04 0.00 19.28 0.00 1.15
Standard errors: OLS

From the first model, we can see that the equations for each region are below:

  • regionMidwest: mental_distress_rate = 0.073 + 0.010 * inequality
  • regionNortheast: mental_distress_rate = 0.073 + 0.001 + 0.010 * inequality
  • regionSouth: mental_distress_rate = 0.073 + 0.014 + 0.010 * inequality
  • regionWest: mental_distress_rate = 0.073 + 0.003 + 0.010 * inequality

From above, we can see that the Midwest has the lowest estimate for mental_distress_rate, and the South has the highest estimate. The coefficient for the Northeast region is not significant at the 0.05 level, while the others are significant.

After including the year factor variables, the economic variables, and the demographic variables, the coefficients on most variables remained significant. However, the coefficient on inequality declined from 0.0101 to 0.00274 in the third model, which suggests that the base model exhibited omitted variable bias that the fuller model at least partially addressed. The South region still has the largest coefficient among the region factor variables.

export_summs(lm_rate_1, lm_rate_2, lm_rate_3, lm_rate_fs,
             statistics = c(Num.obs = "nobs", R2 = "r.squared", adj.R2 = "adj.r.squared"), 
             model.names = c("1", "2", "3", "4"), 
             coefs = tables_a_c
             )
1234
(Intercept)0.07 ***0.06 ***0.16 ***0.05 ***
(0.00)   (0.00)   (0.01)   (0.00)   
inequality0.01 ***0.01 ***0.00 ***0.01 ***
(0.00)   (0.00)   (0.00)   (0.00)   
Northeast0.00    0.00 *  0.00 ***       
(0.00)   (0.00)   (0.00)          
South0.01 ***0.01 ***0.01 ***       
(0.00)   (0.00)   (0.00)          
West0.00 ***0.00 ***0.00 ***       
(0.00)   (0.00)   (0.00)          
college              -0.05 ***-0.08 ***
              (0.00)   (0.00)   
hs_grad              0.01 ***       
              (0.00)          
unempl              0.14 ***-0.21 ***
              (0.01)   (0.01)   
child_poverty              0.05 ***0.12 ***
              (0.00)   (0.00)   
single_parent              0.01 ***-0.06 ***
              (0.00)   (0.00)   
severe_housing              0.02 ***       
              (0.00)          
food_index              -0.00 ***       
              (0.00)          
median_inc              -0.00 ***       
              (0.00)          
pop_provider_ratio              -0.00 ***       
              (0.00)          
pop              0.00 ***       
              (0.00)          
Num.obs18465       18465       15728       15728       
R20.24    0.52    0.82    0.50    
adj.R20.24    0.52    0.82    0.50    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Similar to the previous section, the model produced by feature selection has a much lower adjusted R2 than the model with the full set of control variables. In fact, the model with just region and year factor variables explains more variation in the dependent variable than the feature selection model.

After adding so many different variables, we need to check for multicollinearity. Thus, we also evaluated the VIFs for each of the above models. The majority of VIF results are below 5, which is acceptable. However, several variables have high VIFs – such as child_poverty, pct_black, pct_white and pct_hispanic. We remove these in the following model.

# x5: remove vars to reduce vif
lm_rate_5 <- lm(mental_distress_rate ~ inequality + college + hs_grad + unempl
                + single_parent + severe_housing + food_index
                + pop + pop_provider_ratio + pct_female + pct_rural + pct_below18
                + pct_native_am + pct_asian + pct_pacific 
                + region + year, data = ranked)
#summary(lm_rate_5)
summ(lm_rate_5, vifs = T)
Observations 15728 (2742 missing obs. deleted)
Dependent variable mental_distress_rate
Type OLS linear regression
F(23,15704) 2225.58
0.77
Adj. R² 0.76
Est. S.E. t val. p VIF
(Intercept) 0.06 0.00 19.14 0.00 NA
inequality 0.00 0.00 25.22 0.00 1.80
college -0.07 0.00 -57.91 0.00 2.33
hs_grad 0.00 0.00 3.28 0.00 1.44
unempl 0.19 0.01 27.94 0.00 2.08
single_parent 0.00 0.00 2.82 0.00 2.56
severe_housing -0.01 0.00 -2.70 0.01 2.02
food_index -0.00 0.00 -30.52 0.00 2.30
pop 0.00 0.00 2.22 0.03 1.48
pop_provider_ratio -0.00 0.00 -13.75 0.00 1.25
pct_female 0.19 0.01 36.87 0.00 1.34
pct_rural 0.00 0.00 8.02 0.00 2.06
pct_below18 -0.04 0.00 -11.17 0.00 1.41
pct_native_am 0.03 0.00 15.59 0.00 1.45
pct_asian -0.06 0.00 -12.16 0.00 2.21
pct_pacific 0.08 0.02 3.22 0.00 1.43
regionNortheast 0.00 0.00 3.98 0.00 2.40
regionSouth 0.00 0.00 19.41 0.00 2.40
regionWest -0.00 0.00 -2.39 0.02 2.40
year2017 0.01 0.00 17.64 0.00 1.68
year2018 0.01 0.00 39.88 0.00 1.68
year2019 0.01 0.00 42.55 0.00 1.68
year2020 0.02 0.00 66.71 0.00 1.68
year2021 0.05 0.00 127.31 0.00 1.68
Standard errors: OLS

To permit the coefficients to differ by region, we included an interaction term between inequality and region

# x6: base model by region with interaction terms
lm_rate_6 <- lm(mental_distress_rate ~ inequality*region, data = ranked)
summ(lm_rate_6)
Observations 18465 (5 missing obs. deleted)
Dependent variable mental_distress_rate
Type OLS linear regression
F(7,18457) 861.02
0.25
Adj. R² 0.25
Est. S.E. t val. p
(Intercept) 0.06 0.00 32.79 0.00
inequality 0.01 0.00 26.36 0.00
regionNortheast 0.05 0.00 11.33 0.00
regionSouth 0.02 0.00 9.03 0.00
regionWest 0.01 0.00 3.99 0.00
inequality:regionNortheast -0.01 0.00 -11.18 0.00
inequality:regionSouth -0.00 0.00 -3.49 0.00
inequality:regionWest -0.00 0.00 -3.32 0.00
Standard errors: OLS

From the model x6, we can see that the equations for each region are below:

  • regionMidwest: mental_distress_rate = 0.064421 + 0.012271 * inequality
  • regionNortheast: mental_distress_rate = 0.064421 + 0.050482 + (0.012271 - 0.011059) * inequality
  • regionSouth: mental_distress_rate = 0.064421 + 0.022052 + (0.012271 - 0.001928) * inequality
  • regionWest: mental_distress_rate = 0.064421 + 0.013255 + (0.012271 - 0.002526) * inequality

From the above equations, we can see that the strength of the relationship between mental_distress_rate and inequality differs by region. In fact, while the Northeast has the highest estimated amount of mental distress rate (because the intercept of Northeast region is 0.115), it also has the smallest coefficient on inequality at 0.001. Conversely, the Midwest region has the lowest estimate of mental distress rate (0.064421), while exhibiting the strongest association. All the independent variables are statistically significant.

# x7: larger model with interaction terms
lm_rate_7 <- lm(mental_distress_rate ~ inequality*region + college + hs_grad + unempl
                + single_parent + severe_housing + food_index
                + pop + pop_provider_ratio + pct_female + pct_rural + pct_below18
                + pct_native_am + pct_asian + pct_pacific
                + year, data = ranked)
#summary(lm_rate_7)
summ(lm_rate_7)
Observations 15728 (2742 missing obs. deleted)
Dependent variable mental_distress_rate
Type OLS linear regression
F(26,15701) 1984.99
0.77
Adj. R² 0.77
Est. S.E. t val. p
(Intercept) 0.04 0.00 13.86 0.00
inequality 0.01 0.00 21.59 0.00
regionNortheast 0.02 0.00 6.41 0.00
regionSouth 0.02 0.00 12.58 0.00
regionWest 0.01 0.00 5.84 0.00
college -0.07 0.00 -57.88 0.00
hs_grad 0.00 0.00 3.27 0.00
unempl 0.19 0.01 28.17 0.00
single_parent 0.00 0.00 3.15 0.00
severe_housing -0.01 0.00 -2.68 0.01
food_index -0.00 0.00 -30.34 0.00
pop 0.00 0.00 2.13 0.03
pop_provider_ratio -0.00 0.00 -13.64 0.00
pct_female 0.19 0.01 36.54 0.00
pct_rural 0.00 0.00 8.90 0.00
pct_below18 -0.03 0.00 -9.96 0.00
pct_native_am 0.02 0.00 13.49 0.00
pct_asian -0.06 0.00 -12.26 0.00
pct_pacific 0.08 0.02 3.34 0.00
year2017 0.01 0.00 17.75 0.00
year2018 0.01 0.00 40.07 0.00
year2019 0.01 0.00 42.79 0.00
year2020 0.02 0.00 67.07 0.00
year2021 0.05 0.00 127.88 0.00
inequality:regionNortheast -0.00 0.00 -6.08 0.00
inequality:regionSouth -0.00 0.00 -9.76 0.00
inequality:regionWest -0.00 0.00 -6.42 0.00
Standard errors: OLS
# check vifs but use "high-order" because of interaction terms
# this method computes a GVIF for each high-order term in the model
# GVIF^[1/(2*df)], the last of which is intended to be comparable across terms of different dimension
vif(lm_rate_7, type = "high-order")
                   GVIF Df GVIF^(1/(2*Df))

inequality 6.85e+00 1 2.62 region 1.39e+05 3 7.20 college 2.34e+00 1 1.53 hs_grad 1.45e+00 1 1.20 unempl 2.08e+00 1 1.44 single_parent 2.57e+00 1 1.60 severe_housing 2.03e+00 1 1.43 food_index 2.32e+00 1 1.52 pop 1.49e+00 1 1.22 pop_provider_ratio 1.25e+00 1 1.12 pct_female 1.35e+00 1 1.16 pct_rural 2.08e+00 1 1.44 pct_below18 1.46e+00 1 1.21 pct_native_am 1.60e+00 1 1.26 pct_asian 2.24e+00 1 1.50 pct_pacific 1.44e+00 1 1.20 year 1.69e+00 5 1.05 inequality:region 1.83e+05 3 7.53

The equations for the fuller models with interaction terms are below, where Vc is a vector of additional control variables:

  • regionMidwest: mental_distress_rate = 0.0445 + 0.00696 * inequality + Vc
  • regionNortheast: mental_distress_rate = 0.0445 + 0.0165 + (0.00696 - 0.00348) * inequality + Vc
  • regionSouth: mental_distress_rate = 0.0445 + 0.0198 + (0.00696 - 0.00350) * inequality + Vc
  • regionWest: mental_distress_rate = 0.0445 + 0.0119 + (0.00696 - 0.00300) * inequality + Vc

After including the set of control variables, the coefficients for inequality and the interaction terms are so small that the differences by region are barely noticeable. Looking at these smaller coefficients, the Midwest has the strongest association, while South has the weakest association. However, as a whole, while there is still a statistically significant relationship between mental_distress_rate and inequality, the strength of the relationship seems very minor.

export_summs(lm_rate_1, lm_rate_5, lm_rate_6, lm_rate_7,
             statistics = c(Num.obs = "nobs", R2 = "r.squared", adj.R2 = "adj.r.squared"), 
             model.names = c("1", "5", "6", "7"), 
             coefs = tables_b_d
             )
1567
(Intercept)0.07 ***0.06 ***0.06 ***0.04 ***
(0.00)   (0.00)   (0.00)   (0.00)   
inequality0.01 ***0.00 ***0.01 ***0.01 ***
(0.00)   (0.00)   (0.00)   (0.00)   
Northeast0.00    0.00 ***0.05 ***0.02 ***
(0.00)   (0.00)   (0.00)   (0.00)   
South0.01 ***0.00 ***0.02 ***0.02 ***
(0.00)   (0.00)   (0.00)   (0.00)   
West0.00 ***-0.00 *  0.01 ***0.01 ***
(0.00)   (0.00)   (0.00)   (0.00)   
inequality:NE              -0.01 ***-0.00 ***
              (0.00)   (0.00)   
inequality:SO              -0.00 ***-0.00 ***
              (0.00)   (0.00)   
inequality:WE              -0.00 ***-0.00 ***
              (0.00)   (0.00)   
college       -0.07 ***       -0.07 ***
       (0.00)          (0.00)   
hs_grad       0.00 **        0.00 ** 
       (0.00)          (0.00)   
unempl       0.19 ***       0.19 ***
       (0.01)          (0.01)   
single_parent       0.00 **        0.00 ** 
       (0.00)          (0.00)   
severe_housing       -0.01 **        -0.01 ** 
       (0.00)          (0.00)   
food_index       -0.00 ***       -0.00 ***
       (0.00)          (0.00)   
Num.obs18465       15728       18465       15728       
R20.24    0.77    0.25    0.77    
adj.R20.24    0.76    0.25    0.77    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Without the collinear variables, the larger models perform well again, with an adjusted R2 above 0.75. Including the interaction terms between inequality and region do not change the adjusted R2 very much, but they suggest that there are meaningful differences across regions for the relationship between mental health and inequality.

Model Evaluation

ANOVA Comparison of Models

To evaluate our linear models, we conducted ANOVA tests on their residuals. In the following tables, we compare the ANOVA results for the model with the full set of control variables (#3), the model produced from feature selection (#4), the model removing several variables to reduce multicollinearity (#5), and the model including interaction terms and control variables (#7). We do this for each dependent variable.

The results for mental_health_days suggest that the final model (#7) is the best model. This model includes interaction terms between inequality and region as well as a set of economic and demographic variables. The F-statistics for these comparisons are all statistically significant.

anovaRes_mental_health_days <- anova(lm_days_3, lm_days_fs, lm_days_5, lm_days_7)

xkabledply(anovaRes_mental_health_days, title = "ANOVA comparison of linear models for mental health days")
ANOVA comparison of linear models for mental health days
Res.Df RSS Df Sum of Sq F Pr(>F)
15699 1820 NA NA NA NA
15715 3777 -16 -1957.1 1055.0 0
15702 1956 13 1821.8 1208.7 0
15699 1945 3 10.3 29.8 0

Similar to the above results, the best model is the final model (#7) when mental_distress_rate is the dependent variable. This model includes interaction terms between inequality and region as well as a set of economic and demographic variables. The F-statistics for these comparisons are all statistically significant.

anovaRes_mental_distress_rate <- anova(lm_rate_3, lm_rate_fs, lm_rate_5, lm_rate_7)

xkabledply(anovaRes_mental_distress_rate, title = "ANOVA comparison of linear models for mental distress rate")
ANOVA comparison of linear models for mental distress rate
Res.Df RSS Df Sum of Sq F Pr(>F)
15699 1.50 NA NA NA NA
15715 4.20 -16 -2.6984 1767.6 0
15704 1.96 11 2.2355 2130.0 0
15701 1.95 3 0.0126 43.9 0

Comparing the Dependent Variables

The results from the linear models for both dependent variables – mental_health_days and mental_distress_rate – are similar. Every model has a statistically significant coefficient on inequality, and the coefficients for the majority of regions are statistically significant too. These general trends remained after adding more independent variables.

We included an interaction term between inequality and region to permit the coefficients to differ by region. For both dependent variables, the Midwest region exhibits the strongest association between inequality and mental health, and the South has the weakest association.

However, when mental_distress_rate is the dependent variable, the strength of the association is generally weaker than for the models with mental_health_days as the dependent variable. This could be partly due to how the variables are measured. For example, mental_health_days may range from 0 to 30 (and is usually > 4), while mental_distress_rate is always a percentage ranging from 0 to 1 (and typically < 0.15). Additionally, these differences could also suggest that the relationship between mental health and inequality also depends on the measure of mental health used.

Regression Tree Models

We also experimented with regression tree models to determine if other model types could improve predicting the prevalence of the mental health variables and variable importance for predicting each mental health variable.

The regression tree model analyzes the relationship between number of poor mental health days and the frequent mental distress rate versus all the economic variables included in our dataset. Additionally, region and year were included as explanatory variables to observe the regional and temporal effects on the dependent variable. In other words, the model regresses mental_health_days against inequality, median_inc, hs_grad, college, unempl, child_poverty, single_parent, severe_housing, food_index, mh_providers, pop_provider_ratio, region, and year.

The complexity parameter (cp) and max depth values for each regression tree are listed below. The cp values were increased for the pruned trees to decrease the tree size to improve prediction accuracy.

Reg. Tree mental_health_days: Max depth and cp values
Original Tree: max depth = 8; cp = 0.004
Pruned Tree: max depth = 8; cp = 0.0067

Reg. Tree mental_health_days: Max depth and cp values
Original Tree: max depth = 8; cp = 0.004
Pruned Tree: max depth = 8; cp = 0.01

Overall, the original regression tree models had lower RMSE values than the pruned trees. This indicates that the original trees were the better predicting models.

Regression Tree - mental_health_days

#Regression Tree - Number of Mentally Unhealthy Days

#Remove NA values.
ranked <- na.omit(ranked)

#Creating Train and Test Set - ranked
set.seed(123)
train = sample(1:nrow(ranked), nrow(ranked)/1.25)

#First Model for Number of Mentally Unhealthy Days (ranked; all vars.)
model_ment_helth_day_tree <- rpart(mental_health_days ~ inequality + median_inc + hs_grad + college + unempl + child_poverty + single_parent + severe_housing + food_index + mh_providers + pop_provider_ratio + region + year, ranked, subset=train, control = list(maxdepth=8, cp=0.004))

#Summary for Number of Mentally Unhealthy Days
summ_table1=summary(model_ment_helth_day_tree)
table1_FI_tree<-data.frame(summ_table1$variable.importance)
table1_FI_tree %>%
  kbl(caption="Variable Importance: Target Y - mental_health_days (all vars.) ",
       format= "html", col.names = c("Feature Importance"),
      align="r") %>%
   kable_classic_2(full_width = F, html_font = "helvetica")
Variable Importance: Target Y - mental_health_days (all vars.)
Feature Importance
year 1586.83
child_poverty 1502.05
median_inc 1173.68
food_index 615.87
unempl 588.94
college 583.42
single_parent 545.39
inequality 54.16
region 52.43
mh_providers 14.08
hs_grad 13.31
pop_provider_ratio 12.39
severe_housing 7.11
#Plot Regression Tree
#rpart.plot(model_ment_helth_day_tree)
fancyRpartPlot(model_ment_helth_day_tree, main = "Regression Tree: Target Y - mental_health_days (all vars.)")

#Print Complexity Penalty
#printcp(model_ment_helth_day_tree)

#Plotting Complexity Parameters 
par(mar = c(5, 5, 10, 5)) 
plotcp(model_ment_helth_day_tree, minline = TRUE, lty = 3, col = 1, main="Regression Tree: CP Plot")

#R-squared
#rsq.rpart(model_ment_helth_day_tree)

Pruned Regression Tree - mental_health_days

#Prune Tree - Number of Mentally Unhealthy Days

#Prune Tree Model
prune_model_ment_helth_day_tree=prune.rpart(model_ment_helth_day_tree, cp=0.0067, best=5)


#Summary of Pruned Tree
summ_table2=summary(prune_model_ment_helth_day_tree)
#summ_table2$variable.importance

table2_FI_tree<-data.frame(summ_table2$variable.importance)
table2_FI_tree %>%
  kbl(caption="Pruned Variable Importance: Target Y - mental_health_days (all vars.) ",
       format= "html", col.names = c("Feature Importance"),
      align="r") %>%
   kable_classic_2(full_width = F, html_font = "helvetica")
Pruned Variable Importance: Target Y - mental_health_days (all vars.)
Feature Importance
year 1579.616
child_poverty 1479.220
median_inc 1143.893
food_index 611.472
college 544.814
single_parent 536.570
unempl 516.602
inequality 53.924
region 52.431
hs_grad 11.096
mh_providers 3.515
pop_provider_ratio 1.827
severe_housing 0.868
#Plot Pruned Tree
fancyRpartPlot(prune_model_ment_helth_day_tree, main = "Pruned Regression Tree: Target Y - mental_health_days (all vars.)")

#Print Complexity Penalty of Prune Tree
#printcp(prune_model_ment_helth_day_tree)

#Plot Complexity Penalty of Prune Tree 
par(mar = c(5, 5, 10, 5)) 
plotcp(prune_model_ment_helth_day_tree,minline = TRUE, lty = 3, col = 1, main="Pruned Regression Tree: CP Plot")

#R-squared
#rsq.rpart(prune_model_ment_helth_day_tree)

Predicting/RMSE for Regression Trees - mental_health_days

#Predicting and RMSE for Regression Trees

#Prediction/Testing - Original Model
yhat1=predict(model_ment_helth_day_tree,newdata=ranked[-train,])
test1=ranked[-train,"mental_health_days"]

plot(yhat1,test1, main="Regression Tree: Test vs. Predicted Values", xlab="Predicted Values", ylab="Test Values")
abline(0,1)

#Prediction/Testing - Pruned Model
yhat_prune1=predict(prune_model_ment_helth_day_tree,newdata=ranked[-train,])


plot(yhat_prune1,test1, main="Pruned Regression Tree: Test vs. Predicted Values", xlab="Predicted Values", ylab="Test Values")
abline(0,1)

Original Reg. Tree RMSE:

#MSE
#mean((yhat1-test)^2)

#RMSE
sqrt(mean((yhat1-test1)^2))
## [1] 0.427

Pruned Tree RMSE:

#MSE
#mean((yhat_prune1-test)^2)

#RMSE
sqrt(mean((yhat_prune1-test1)^2))
## [1] 0.435

Regression Tree - mental_distress_rate

#Regression Trees - Mental Health Distress Rate

#Creating Train and Test Set - ranked
set.seed(123)
train = sample(1:nrow(ranked), nrow(ranked)/1.25)

#First Model for Mental Health Distress Rate (ranked; all vars.)
model_ment_dis_rate_tree <- rpart(mental_distress_rate ~ inequality + median_inc + hs_grad + college + unempl + child_poverty + single_parent + severe_housing + food_index + mh_providers + pop_provider_ratio + region + year, ranked, subset=train, control = list(maxdepth=8, cp=0.004))

#Summary for Number of Mentally Unhealthy Days
summ_table3=summary(model_ment_dis_rate_tree)
#summ_table3$variable.importance

table3_FI_tree<-data.frame(summ_table3$variable.importance)
table3_FI_tree %>%
  kbl(caption="Variable Importance: Target Y - mental_distress_rate (all vars.) ",
       format= "html", col.names = c("Feature Importance"),
      align="r") %>%
   kable_classic_2(full_width = F, html_font = "helvetica")
Variable Importance: Target Y - mental_distress_rate (all vars.)
Feature Importance
child_poverty 2.210
year 2.038
median_inc 1.987
food_index 1.117
college 1.111
single_parent 0.995
unempl 0.803
inequality 0.243
mh_providers 0.036
pop_provider_ratio 0.035
hs_grad 0.015
region 0.012
severe_housing 0.002
#Plot Regression Tree
#rpart.plot(model_ment_dis_rate_tree)
fancyRpartPlot(model_ment_dis_rate_tree, main = "Regression Tree: Target Y - mental_distress_rate (all vars.)")

#Print Complexity Penalty
#printcp(model_ment_dis_rate_tree)


#Plotting Complexity Parameters
par(mar = c(5, 5, 10, 5))
plotcp(model_ment_dis_rate_tree, minline = TRUE, lty = 3, col = 1, main="Regression Tree: CP Plot")

#R-squared
#rsq.rpart(model_ment_dis_rate_tree)

Pruned Regression Tree - mental_distress_rate

#Prune Tree - Mental Health Distress Rate 

#Prune Tree Model
prune_model_ment_dis_rate_tree=prune.rpart(model_ment_dis_rate_tree, cp=0.01, best=5)

#Summary of Pruned Tree
summ_table4=summary(prune_model_ment_dis_rate_tree)
#summ_table4$variable.importance
table4_FI_tree<-data.frame(summ_table4$variable.importance)

table4_FI_tree %>%
  kbl(caption="Pruned Variable Importance: Target Y - mental_distress_rate (all vars.) ",
       format= "html", col.names = c("Feature Importance"),
      align="r") %>%
   kable_classic_2(full_width = F, html_font = "helvetica")
Pruned Variable Importance: Target Y - mental_distress_rate (all vars.)
Feature Importance
child_poverty 2.193
median_inc 1.960
year 1.837
college 1.111
food_index 1.098
single_parent 0.981
unempl 0.751
inequality 0.227
mh_providers 0.035
pop_provider_ratio 0.034
region 0.012
hs_grad 0.006
severe_housing 0.001
#Plot Pruned Tree
fancyRpartPlot(prune_model_ment_dis_rate_tree, main = "Pruned Regression Tree: Target Y - mental_distress_rate (all vars.)")

#Print Complexity Penalty of Prune Tree
#printcp(prune_model_ment_dis_rate_tree)

#Plot Complexity Penalty of Prune Tree 
par(mar = c(5, 5, 10, 5))
plotcp(prune_model_ment_dis_rate_tree,minline = TRUE, lty = 3, col = 1, main="Pruned Regression Tree: CP Plot")

#R-squared
#rsq.rpart(prune_model_ment_dis_rate_tree)

Predicting/RMSE for Regression Trees - mental_distress_rate

#Predicting and RMSE for Regression Trees

#Prediction/Testing - Original Model
yhat2=predict(model_ment_dis_rate_tree,newdata=ranked[-train,])
test2=ranked[-train,"mental_distress_rate"] 

plot(yhat2,test2, main="Regression Tree: Test vs. Predicted Values", xlab="Predicted Values", ylab="Test Values")
abline(0,1)

#Prediction/Testing - Pruned Model
yhat_prune2=predict(prune_model_ment_dis_rate_tree,newdata=ranked[-train,])


plot(yhat_prune2,test2, main="Pruned Regression Tree: Test vs. Predicted Values", xlab="Predicted Values", ylab="Test Values")
abline(0,1)

Original Reg. Tree RMSE:

#MSE
#mean((yhat2-test)^2)

#RMSE
sqrt(mean((yhat2-test2)^2))
## [1] 0.0126

Pruned Tree RMSE:

#MSE
#mean((yhat_prune2-test)^2)

#RMSE
sqrt(mean((yhat_prune2-test2)^2))
## [1] 0.0134

Random Forest Models

Finally, we constructed random forest models to further analyze the variable importance and how accurately a different model type can predict the mental health target variables. The first random forest model analyzes the number of poor mental health days and the second random forest model analyzes the frequent mental distress rate. The random forest models use the same independent variables as the regression trees.

The top three most important variables in order of importance by %IncMSE are year, median_inc, and region. The top three most important variables in order of importance by IncNudePurity are year, child_poverty, and median_inc. This occurred for both target mental health variables.

Overall, the random forest models were better predicting models relative to the regression tree models as indicated by their lower RMSE values for each respective target mental health variable.

Random Forest - mental_health_days

#Random Forest - Num. of Mentally Unhealthy Days

#Generate Random Forest Model - Num. of Mentally Unhealthy Days (all vars.)
set.seed(123)
bag_ment_hlth_days=randomForest(mental_health_days ~ inequality + median_inc + hs_grad + college + unempl + child_poverty + single_parent + severe_housing + food_index + mh_providers + pop_provider_ratio + region + year, data=ranked, subset=train, mtry=13, importance=TRUE)

#bag_ment_hlth_days

#Predict/Test Random Forest Model
yhat.bag1 = predict(bag_ment_hlth_days,newdata=ranked[-train,])
plot(yhat.bag1, test1, main="Random Forest: Test vs. Predicted Values", xlab="Predicted Values", ylab="Test Values")
abline(0,1)

RMSE:

#MSE
#mean((yhat.bag1-test)^2)

#RMSE
sqrt(mean((yhat.bag1-test1)^2))
## [1] 0.327
#Determine Variable Importance
#importance(bag_ment_hlth_days)

#Plot Variable Importance 
varImpPlot(bag_ment_hlth_days, main = "Random Forest Vars. Importance: Target Y - mental_health_days (all vars.)")

Random Forest - mental_distress_rate

#Random Forest - Mental Distress Rate

#Generate Random Forest Model - Mental Health Distress Rate 
set.seed(123)
bag_ment_dis_rate=randomForest(mental_distress_rate ~ inequality + median_inc + hs_grad + college + unempl + child_poverty + single_parent + severe_housing + food_index + mh_providers + pop_provider_ratio + region + year, data=ranked, subset=train, mtry=13, importance=TRUE)

#bag_ment_dis_rate

#Predict/Test Random Forest Model
yhat.bag2 = predict(bag_ment_dis_rate,newdata=ranked[-train,])
plot(yhat.bag2, test2, main="Random Forest: Test vs. Predicted Values", xlab="Predicted Values", ylab="Test Values")
abline(0,1)

RMSE:

#MSE
#mean((yhat.bag1-test)^2)

#RMSE
sqrt(mean((yhat.bag2-test2)^2))
## [1] 0.00915
#Determine Variable Importance
#importance(bag_ment_dis_rate)

#Plot Variable Importance 
varImpPlot(bag_ment_dis_rate, main = "Random Forest Vars. Importance: Target Y - mental_distress_rate (all vars.)")

RMSE Comparison for All Models

Below is RMSE comparisons between all the models used in this project. Each model compares the target variable mental_health_days and mental_distress_rate against the independent variables inequality, college, hs_grad, unempl, child_poverty, single_parent, severe_housing, food_index, median_inc, mh_providers, pop_provider_ratio, region, and year.

Based on the RMSE values the random forest models are the best predicting models because they have the lowest RMSE values. A low RMSE value indicates that the simulated predicted data is close to the observed test data, indicating better accuracy relative to higher RMSE values.

RMSE Model Comparisons - mental_health_days

#RMSE Model Comparisons - mental_health_days

lm_days_train <- lm(mental_health_days ~ inequality + college + hs_grad + unempl
                + child_poverty + single_parent + severe_housing + food_index
                + median_inc + mh_providers + pop_provider_ratio
                + region + year, data=ranked, subset=train)

yhat_lm_1 = predict(lm_days_train,newdata=ranked[-train,])

#RMSE: Linear Model
a1<-sqrt(mean((yhat_lm_1-test1)^2))

#RMSE: Regression Tree
b1<-sqrt(mean((yhat1-test1)^2))

#RMSE: Pruned Regression Tree
c1<-sqrt(mean((yhat_prune1-test1)^2))

#RMSE: Random Forest
d1<-sqrt(mean((yhat.bag1-test1)^2))

df1 <- data.frame(Models1 <- c("Linear Model", "Reg. Tree", "Pruned Tree", "Random Forest"), RMSE1 <- c(a1, b1, c1, d1) )

df1 %>%
  kbl(caption="RMSE Comparisons - mental_health_days",
       format= "html", digits = 4, col.names = c("Models","RMSE"),
      align="r") %>%
   kable_classic_2(full_width = F, html_font = "helvetica")
RMSE Comparisons - mental_health_days
Models RMSE
Linear Model 0.390
Reg. Tree 0.427
Pruned Tree 0.435
Random Forest 0.327

RMSE Model Comparisons - mental_distress_rate

lm_dist_train <- lm(mental_distress_rate ~ inequality + college + hs_grad + unempl
                + child_poverty + single_parent + severe_housing + food_index
                + median_inc + mh_providers + pop_provider_ratio
                + region + year, data=ranked, subset=train)

yhat_lm_2 = predict(lm_dist_train,newdata=ranked[-train,])

#RMSE: Linear Model
a2<-sqrt(mean((yhat_lm_2-test2)^2))

#RMSE: Regression Tree
b2<-sqrt(mean((yhat2-test2)^2))

#RMSE: Pruned Regression Tree
c2<-sqrt(mean((yhat_prune2-test2)^2))

#RMSE: Random Forest
d2<-sqrt(mean((yhat.bag2-test2)^2))

df2 <- data.frame(Models2 <- c("Linear Model", "Reg. Tree", "Pruned Tree", "Random Forest"), RMSE2 <- c(a2, b2, c2, d2) )

df2 %>%
  kbl(caption="RMSE Comparisons - mental_distress_rate",
       format= "html", digits = 4, col.names = c("Models","RMSE"),
      align="r") %>%
   kable_classic_2(full_width = F, html_font = "helvetica")
RMSE Comparisons - mental_distress_rate
Models RMSE
Linear Model 0.0110
Reg. Tree 0.0126
Pruned Tree 0.0134
Random Forest 0.0091

Conclusion

The EDA and hypothesis testing conducted in the Midterm Project suggested that the relationship between mental health and economic inequality differs by region. Linear modeling indicates that the relationship between mental health and economic inequality is statistically significant. However, the strength of association between mental and economic inequality varies by region. Both measures of mental health exhibited similar relationships, but specifically the number of poor mental health days variable displayed more pronounced associations. The Midwest region had the strongest association between each mental health dependent variable. The South had the weakest association between each mental health dependent variable.

As a result, we draw the following conclusions related to our SMART questions:

  1. The relationship between mental health and income inequality in the U.S. remains statistically significant when including other economic variables.
  2. The relationship differs across U.S. regions, with the association being strongest in the Midwest and weakest in the South.
  3. The relationship remains consistent even when using two different measures for mental health. However, the strength of the association for mental_health_days seems stronger than the correlation for mental_distress_rate.

The regression tree models constructed for both mental health dependent variables indicated that the top three most important variables are year, percentage of child poverty, and median income. Interestingly, for both mental health dependent variables the income inequality variable was considered to be the eighth most important variable. Although this finding does not outright contradict the notion that income inequality plays an important role in determining poor mental health, it must be noted that the income inequality variable does rank lower than expected. This is surprising given that the income inequality variable is the most direct measure of economic inequality as it is a ratio of household income between the 80th and 20th percentiles. One possible explanation for this is that the income inequality variable does not have a lot of variation within the dataset, mitigating its ability to explain the variation in the two mental health dependent variables.

The random forest models constructed produced somewhat similar results regarding variable importance. However, depending on the measure of variable importance, either %IncMSE or IncNodePurity, different variables were ranked within the top three most important. For both random forest models, the top three most important variables measured by %IncMSE include year, median income, and region. For both random forest models, the top three most important variables measured by IncNodePurity include year, percentage of child poverty, and median income. Again, income inequality ranked lower than expected for both random forest models. The income inequality variable was ranked as the eleventh most important variable by both measures of variable importance. A consistent result across all models is the importance of the region variable, indicating that poor mental health differs regionally in the United States. However, it is unclear if these regional differences are due to economic inequality, differences in regional and social attitudes discussing mental health, data collection biases, or some other unknown influence.

Finally, the RMSE model comparisons revealed that the random forest models were the best predicting models due to their low RMSE values relative to the other models. The worst predicting models were the regression trees due to their large RMSE values relative to the other models. The reason for why the regression trees are the worse predicting models is because they are overfitting the data and are unable to make accurate predictions for data different from the train set. The random forest models likely performed better at prediction because they effectively deal with high dimensional data, they are robust to outliers, they are indifferent to non-linear data, they can handle unbalanced data, they have low bias, and they have moderate variance. Overall, random forest models should probably be used to make reliable predictions regarding the prevalence of poor mental health based on various economic variables.

Limitations

Although these conclusions are interesting and insightful, there are severe limitations that curtail the reliability and accuracy of the results described above.

First, the mental health target variables rely on self-reported mental health data. This data cannot be validated with medical records. Further research could be done by using a binary mental health variable in order to generate a confusion matrix, accuracy score, recall score, precision score, and ROC-AUC graphs.

Secondly, there is a strong time influence on the models, as indicated by the year variable consistently being ranked as one of the top most important variables in the regression tree and random forest models. This makes it difficult to determine if economic inequality has a substantial impact on mental health or if these variable are just correlated over time. Further research may want to focus on data within a single time period.

Third, the models could be constructed and tuned better to provide more reliable predictions and lower RMSE values. For example, the regression trees could be reduced in size by choosing a smaller max depth and a more aggressive complexity parameter. Additionally, we could extend the model selection tools we use to select better linear models. Although we calculated the R-squared, adjusted R-squared, Mallow Cp, and BIC during feature selection, we primarily made our ultimate comparisons between models based on the adjusted R-squared and ANOVA tests. We focused on the adjusted R-squared because it penalizes the addition of independent variables after variance explanation has plateaued. However, for future research, we may want to place more focus on Mallow Cp or BIC values respectively. These assist with model selection by providing a measure for model performance that accounts for model complexity.

Overall, this project suggests that our initial hypothesis, mental health and economic inequality are related, was largely on the right track. However, our results also indicate that other predictors might be more important than income inequality when considering the effects on mental health. Even though there are limitations with this project, further analysis is needed on the relationship between mental health and economic inequality.

References

Robert Wood Johnson Foundation (RWJF) (2021). 2021 County Health Rankings National Data. County Health Rankings & Roadmaps. https://www.countyhealthrankings.org/explore-health-rankings/rankings-data-documentation. Accessed: March 13, 2022.

Robert Wood Johnson Foundation (RWJF) (2021). County Health Rankings Model. County Health Rankings & Roadmaps. https://www.countyhealthrankings.org/explore-health-rankings/measures-data-sources/county-health-rankings-model. Accessed: March 13, 2022.

U.S. Census Bureau. (2010). Census Regions and Divisions of the United States. U.S. Department of Commerce Economics and Statistics Administration. https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf. Accessed: March 13, 2022.